LAMENT: A tale of constant struggle of what it's like trying to scale on ICP

saikatdas0790 · October 4, 2024, 11:30am

Some History

Yral, as the name suggests, has been designed and architected to “go viral”. We have always endeavoured to build an app that can scale and absorb traffic for millions of users. However, every time we start driving traffic, we’ve been bottlenecked by constraints and limits imposed by the IC network

We had started with a single canister architecture for the very first version of our app which was written in Motoko back in late 2021 - early 2022 back when we were called GoBazzinga. It started receiving some initial traffic but around the 10,000 active users mark we noticed significant bottlenecking and slowdowns which led us to realize that a single canister (which is single threaded) can only scale so much.

So, around late 2022, when we launched a new game and we rebranded to Hot or Not, we rewrote our entire backend to a multi canister system, where we would spin up individual canisters for every individual signing up. This would ensure that every user signing up would have enough compute/storage/bandwidth resources that they needed to do whatever we could possibly imagine for them to do.

We were acutely aware of the limitations of a single subnet, however, we’ve had multiple conversations with the DFINITY team during that time that subnet splitting was actively being worked on and would be available soon. Subnet splitting essentially allows a subnet to split into two and divide canisters into two groups and divide all canisters into two different subnets. We were hoping this would be able to solve our scaling challenges as every time a subnet for us became full, we would just split the subnet to two and continue scaling.

In the meantime, the Hot or Not game was a hit and we quickly started getting traction as the subnet started filling up. At its peak, we had 300k active users with 50k users signed in with their own canisters. At this point, we had to spill over to more subnets to be able to scale further. However, subnet splitting was still not available and we had to shut down sign-ups to the game as we were running out of canisters to spin up for new users.

This was devastating for us as we had to SHUT DOWN sign-ups to our app for MONTHS. All of this, while we were trying to complete our SNS sale. That was another whole can of worms. You think testing for SNS is difficult now? Imagine what it was like for us doing it as the 2nd ecosystem app on the IC doing the SNS sale independently.

So we do empathise with @jamesbeadle as we look at instances like this as we sense their clear frustration with what that entire process looks like.

Where we currently are

However, with a lot of sleepless nights and loads of effort, we managed to complete our SNS sale. Shout out to all of our investors and all the supportive friends who’d believed in us till this stage.

At this stage, we had a successful SNS sale and we had a treasury with which to fund our subsequent growth story.

This time, we truly wanted to build a backend that could scale to millions of signed in users and not have to rely on Dfinity shipping features like subnet splitting for us to scale.

So, we started building our own multi-subnet, multi-canister, dynamically load balancing backend. It required us to keep our heads down and build. During this time, we had to deal with a lot of criticism as we were not shipping any user facing features. We were also walking a path that no one had walked before and hence there was a certain uncertainty as to whether we would be able to pull it off.

However, early this year, we shipped our new backend that could dynamically load balance between multiple subnets and multiple canisters. And we were itching to put it through its paces and stress the entire network.

In the meantime, we shipped a bunch of user facing features like:

a completely new authentication mechanism that takes seconds to sign up instead of tens of seconds like II
a frontend written in Rust and WASM for the fastest experience that browsers can offer
a AI powered video recommendation feed that can recommend videos to users based on their preferences

Subsequently, we looked around at the market and noticed that memecoins are the flavour of the season and with our new systems in place, we could ship a memecoin creation platform in weeks that could rival and beat the best of the best and do it ALL ON CHAIN with support for ALL the open ICRC standards.

As we’ve started to scale the meme token platform, we’ve enabled every user to spin up their own tokens WITHOUT even needing to SIGN UP or PAY A SINGLE DOLLAR IN CYCLE COSTS. And we do it ALL ON CHAIN.

However, as we start to scale, almost every other app on the IC has started to bottleneck as we start to massively drive traffic to the IC on all the subnets.

Here’s @Manu from Dfinity acknowledging that most of the load is from Yral presence on all the subnets. He’s since retracted his original statement, but due to the way Discourse functions, you can still see it on this reply

Here’s links to a couple of threads which all raise the issue of instability and bottlenecks arising from us starting to drive traffic to all the subnets on the IC:

You can find way more, all originating around 20 or so days back, when we started with our first airdrop as documented here:

This entire effort also required a significant amount of cycles to fund. This is 1 such screenshot of us converting to cycles to fund our canisters.

We did this 4-5 times in the last 2-3 weeks. We’ve spent close to $200,000 in cycle costs in the last 2 months owing to how regressive cycle reservation and inefficient the current infrastructure is.

What are we asking for

As I was writing this, I refrained from posting the original draft as there was a lot of frustration that was surfacing from the REPEATED ROADBLOCKS that we run into every time we START SCALING AND DRIVING TRAFFIC, GAIN MOMENTUM, ONBOARD USERS but the momentum gets KILLED due to network constraints and bottlenecks that are outside our control.

This time as well, we’ve been noticing the short term discussion seems to be towards raising cost to the point that we’re artificially constraining high growth apps.

There’s recently been a new change to how subnets charge canisters called the “cycle reservation mechanism” which essentially increases the cost of cycles on a subnet exponentially as the subnet starts to fill up. This is essentially a mechanism to artificially constrain growth of apps on a subnet.

We’re affected significantly by this as some of the subnets that we’re on have started to have cycle costs where canisters are starting to reserve 2-5 Trillion cycles for a single canister. Imagine the possible cost for a fleet of canisters that currently number > 200_000 canisters and that intend to grow to a number larger than 1 Million canisters

We’re not asking for a free ride. We’re asking for a fair shot at being able to scale and grow our app without being artificially constrained by the network.

IF the IC intends to be a world computer, it needs to be able to handle a measly growth of tens of thousands of canisters without breaking a sweat. We’re not even talking about millions of canisters yet.

We’re asking for focus on improving the protocol drastically to add mechanisms for

easy migrations
load detection
dynamic load balancing

Exponentially growing cycle costs on subnets to limit apps from growing citing aversion to spiky traffic is absolutely the wrong way to go. The entire internet operates on spiky traffic from Black Friday sales to Taylor Swift concert ticket sales.

If the IC wants to grow, it needs to support the apps that are trying to scale and bring new users onboard instead of artificially trying to constrain their growth.

We will however continue building and breaking new ground, no matter what it takes.

Thank you for patiently reading through this.

– Just A frustrated developer trying to scale on the IC

Gwojda · October 4, 2024, 12:30pm

100% aligned here.
It’s even more frustrating when you know that there’s still a pair of unused nodes available for scale.
Moving a canister to another subnet is problematic because the canister id changes and you have to change it in several places each time, which is very annoying.
Ideally, a canister should be able to change subnet according to the load of the subnet in question, without changing canister ids. A bit like on kubernetes where pods can change nodes without any impact whatsoever, it’s transparent.
I imagine it’s easy to say, but quite difficult to implement and that’s why we are where we are actually : scalability issue, with tons of unused nodes.

Gabriel · October 4, 2024, 1:00pm

Hey, thanks for writing this up. Your frustration is almost palpable.

Out of curiosity do you have any benchmarks of a subnet limitation that you’ve run into?

Ajki · October 4, 2024, 2:18pm

You guys really had a tough journey, and it’s amazing you made it through. You’re the real pioneers in scaling on ICP, and DFINITY should definitely step up and support you.

jamesbeadle · October 4, 2024, 2:26pm

I agree with this, it’s been such a frustrating process. My codebase hasn’t really changed since I was ready to do my SNS at the start of the year. All I’ve done for months is stumble across incomplete documentation, bare bones dev tooling and disperate support for issues largely out of my control.

Just the sns-testing repo itself highlights the problem for me, the first part creates a local NNS to deploy the token, a really useful dev tool to make things easy. But then you actually try to get your project in there and it becomes an exercise in linux commands and shell scripts. I don’t understand how this repo hasn’t been updated to be fit for purpose in the last 6 months.

As for the scaling it looks like at least a x15 increase in price for a canister (2T to 32T min cycles to add to create with compute_allocation=1), add the additional compute costs I do hope this is only temporary.

I would just rather see more focus on getting what has been created to be perfect over running ahead in what seems like chasing the latest fad (AI) to increase the price of ICP.

Manu · October 4, 2024, 2:29pm

Hi @saikatdas0790!

I understand the challenges you’re facing. I fully agree that there is a lot of progress to be made in terms of balancing load on ICP, handling busy subnets more gracefully, perhaps allowing easier canister migration between subnets, and more. As I mentioned in another thread, we are very much looking into this and hoping to make significant improvements soon. The compute load on ICP has skyrocketed lately, and the canister count has doubled in the past year (from ~300k to 600k), so naturally some growing pains show up and DFINITY will prioritize addressing them. First improvements are coming in the replica version that DFINITY will propose today.

That being said, the two main issues you bring up (subnets currently not handling huge numbers of canisters well, and the cycle cost of a fleet of canisters) boil down to the architecture that Yral chose to follow, where it uses a new canister for every user. DFINITY R&D has repeatedly warned Yral that this is not a scalable architecture, urged the Yral team to revisit this choice, and offered help to make that change. Yral chose to stick to the one-canister-per-user approach, and now run into scalability challenges and the per-canister costs. My advice remains the same: don’t use a new canister per user, for projects that aim to onboard a big amount of users.

fwiw this is not true, i wrote exactly what you see now, nothing is “redacted”, what you’re seeing is another user’s comment modifying a quote of my message.

Manu · October 4, 2024, 2:36pm

This is not how it works. The cost of 1% compute allocation is roughly 35T a month, it does not “scale” the cost in any way. Because it’s a continuous cost, it changes the freezing limit of your canister. So if you have configured that you want your canister to have a freezing threshold of 30 days, adding a 1% compute allocation increases the freezing limit of your canister by roughly 35T. So when increasing a compute allocation, the system may tell you you should first add more cycles to avoid immediately freezing your canister.

saikatdas0790 · October 4, 2024, 7:59pm

These are not hard limitations that you’d usually run into. It would mostly start manifesting as timeouts or delayed execution as the subnets start to fill up.

To do a synthetic benchmark on mainnet, you’d have to generate significant load and burn cycles to be able to replicate.

We just try to make sound assumptions and ship product and then try and drive traffic and that automatically validates most assumptions that we are trying to validate.

So, yeah, benchmark/test on production

saikatdas0790 · October 4, 2024, 9:33pm

Thank you for the response. Before I provide counter arguments to the points made, let me preface this discussion with this relevant famous Steve Jobs’ incident. “Don’t hold your phone that way”

Instead of simply dismissing an earnest and heartfelt frustration delivered as feedback by one of the largest and long-time ecosystem builder (Yral), by platform and protocol builders (Dfinity), it would have been much more helpful if we have a conversation with an acknowledgement of the underlying inefficiencies of the protocol and what we can do to improve the situation because it’s only going to get worse from here on out.

Now, onto the meat of the discussion.

Let’s look at some numbers:

Snapshots of canister counts for subnets running into failures:

lspz2 → Yral - 24k canisters
fuqsr → old Bob/other miner? - 2k canisters
6pbhf → Yral - 24k canisters
e66qm → Yral - 25k canisters
bkfrj → current Bob - 20k canisters
3hhby → Yral - 21k canisters
nl6hn → Yral - 20k canisters
opn46 → Yral - 22k canisters
lhg73 → Yral - 25k canisters
k44fs → Yral - 29k canisters

Let’s compare with subnets hosting significant number of canisters:

eq6en → Dmail? + Yral - 91k canisters
2fq7c → OpenChat - 91k canisters
o3ow2 → Yral - 56k canisters

You can check out subnet stats here

The ones with significantly more canisters are running completely fine because projects stopped deploying to them close to the 450GiB memory limit. They are also not bottlenecked because activity on those subnets is relatively low as compared to the above ones where we are actively driving new traffic. This is evident from the MIEPs on those subnets.

So claiming the one-canister-one-user approach is the problem is DEFINITELY NOT THE ROOT CAUSE HERE.

Let’s look at currently who are the apps that are looking to onboard a large number of users or are already doing so:

OpenChat - lead devs are ex-Dfinity engineers. Uses a single canister per user model. Has overwhelming support from Dfinity for R&D and engineering. Not claiming favouritism, but pointing out that they are privy to a lot of early protocol related discussions owing to their close association with Dfinity. Them doubling down on an approach probably points to thoughtful consideration of pros and cons. Dfinity helped acquire dedicated subnets. Dfinity helped lower per canister cost from 1T cycles to 0.1T cycles to help reduce user acquisition cost
Dscvr - one of the earliest social Dapps on IC that had/has a significant user base. Pushed to the very limits of what’s possible on a single canister model even on multiple subnets. Dfinity engineers specifically consulted with them to help them alleviate bottlenecks while they migrated to a multi canister solution. Eventually gave in and moved to a multi-canister model where every community spawned on the Dscvr platform is a separate canister.
Dragginz - the most ambitious game yet on the IC with a Minecraft like world building capability. They also have these concepts of worlds/hubs. They’ve modeled them as a single canister per world/hub.
Catalyze - they quickly outgrew their single canister model and moved to a multi-canister model where every community gets a separate canister for hosting data
Yral - little ole us

Maybe NOT that LITTLE

Oversimplified, APPROX. 300_000 is us, 200_000 is OpenChat, rest is the entire ecosystem is what I approximate.

We can come back with hard numbers though

Now, let’s look at the alternative architecture that Dfinity is suggesting.

“Dfinity R&D” has suggested that instead of spinning up a new canister for every new canister, we should stuff as many users as possible into a single canister and when that canister is full, then spill over into a separate canister. Let’s look at the cons of that approach and the pros of the per canister approach

Cons of stuffing as many users as possible into a single canister:

Figure out sharding strategy both inside the canister and across canisters running on the same and different subnet. Lots of complexity here. Additional complexity as you write more code that breaches the 2MB limit
Instead of using platform provided sharding, you have to write your own sharding logic. This is a significant amount of work and complexity that you have to manage
Once individual user data outgrow the canister, you have to figure out how to atomically rebalance the data across multiple canisters, possibly across subnets
As far as I recall, a single canister can APPROX consume at max 1/4 of the maximum compute/storage capability available on a subnet. So, you still have to figure out how to scale across multiple canisters and multiple subnets, but now with the added overhead of managing internal data sharding inside canisters
Figure out robust/secure data boundaries between users in a single canister for data sharing between users inside a single canister but also account for user data living in other canisters

Pros of per canister approach:

Very closely follows the actor model which is the fundamental building block of the IC/canisters
Very clear upgrade/install mechanism for atomic upgrades to individual user canister data stores
Platform enforced memory and compute isolation between canisters
Platform enforced security isolation between canisters
A network of canisters message passing to each other is a clear manifestation of a social graph (which is what Yral is) with nodes message passing to each other via their edges
Spilling over to a new subnet on current subnet full is a very clear scaling mechanism of just deploying new canisters to a new subnet
High Growth in a single canister doesn’t require developer attention to rebalance as individual canisters have significant growth headroom

Let’s look at it from another perspective:

It has taken Dfinity over a year to ship subnet splitting, a data sharding mechanism for splitting subnet/canister loads, and it’s still not generally available considering Dfinity R&D has over 200 of the world’s best engineers/researcher as claimed.

It is QUITE UNFAIR to expect Yral, an org 1/100th the size of Dfinity, with access to a treasury, 1/1000th the size, an ecosystem project, to just figure the above out on their own with “help” from Dfinity.

This is definitely not a shot at Dfinity, but more of a grounding of facts to consider, when just pointing to us and expecting us to “make it work” without appreciating the kind of effort that goes into building something at this scale.

We are always grateful for all the help and support we receive(d) from Dfinity and we’re always open to suggestions and feedback. However, we’re also very clear about the kind of architecture we want to build and the kind of user experience we want to provide to our users.

Finally, do consider that we have significant plans to add more functionality to our canisters which requires us to maintain growth headroom in our canisters so that we can put more functionality in them without having to worry about running out of space. We don’t want to reveal all of our plans at this stage and hence it’s premature to just expect us to move to a different architecture without understanding the full context of what we’re trying to achieve.

Consider for example, we’ve been pushing towards user owned data/wallet canisters and that is simply not possible with the Dfinity suggested solution. There are nuances to such decisions that are only appreciated when you’re in the trenches building these systems.

However, we will continue to have free discussions and deliberations with Dfinity and remain humble and loyal towards any suggestion that is a short and long term solution.

Happy to hear more as things improve.

P.S.

I believe my assertion earlier about Manu redacting is incorrect and in this case that edit was made by another user. My apologies.

However the facts still remain that those loads are being caused by canisters we’ve been deploying owing to increased traffic that we’re driving.

zensh · October 5, 2024, 3:07am

When designing the horizontal scalability of ICPanda Message, I chose a canister pool model. A canister is randomly selected from the pool to create a message channel. Once the data in a canister reaches a certain threshold, it is removed from the pool and becomes a matured canister.

github.com

ldclabs/ic-panda/blob/main/src/ic_message/src/store.rs#L861


      
              channel_kek_key, ChannelInfo, ChannelKEKInput, ChannelTopupInput, CreateChannelInput,
          };
          
          pub async fn create_channel(
              caller: Principal,
              now_ms: u64,
              mut input: CreateChannelInput,
          ) -> Result<ChannelInfo, String> {
              let (channel_canister, profile_canister, price) = state::with(|s| {
                  let i = if s.channel_canisters.len() > 1 {
                      now_ms % s.channel_canisters.len() as u64
                  } else {
                      0
                  };
                  (
                      s.channel_canisters.get(i as usize).cloned(),
                      s.profile_canisters.last().cloned(),
                      s.price.clone(),
                  )
              });

senior.joinu · October 5, 2024, 7:47am

Hey, @saikatdas0790

If I understand correctly, this time you’ve hit network limitations during an airdrop(s)?
Could you please elaborate on what exactly caused such a big load? What was the workload and how many users were involved?

From the screenshot of your tweet it seems like the prize pool for the airdrop was only $1000. Since the prize was in BTC, I assume it was intended to be split among all airdrop participants. It seems kinda unusual that such a small potential reward could attract so many users, because the prize pool limits that number agressively.

Or there were other airdrops?

quinto · October 5, 2024, 1:13pm

Really appreciate your sharing of frustrations you have encountered at scaling apps on IC. It is a challenge that many of the fellow developers will face. But one thing that wasn’t made clear to me, before you ran into the subnet CPU bottleneck, why didn’t you start to use more subnets? There are 31 application subnets, and some of them are really empty…

saikatdas0790 · October 5, 2024, 1:51pm

We are in the process of spilling over. You’ll notice those subnets lighting up next week onwards. Most of them were made available recent-ish and we were just using our index canisters already on the current subnets. It was a simple matter of updating our config to target these new subnets.

Also, not all of them are deployable to. Only around 15-ish of them are deployable to for a general app.

Here’s an indicative list:
https://dashboard.internetcomputer.org/proposal/132409

jamesbeadle · October 7, 2024, 9:03am

Hi @Manu

So I am getting an error relating to cycles and this 1% thing, how does it know that it now needs 265T cycles? How do I calculate what a canister needs to run my app like before?

Manu · October 7, 2024, 9:45am

hey @jamesbeadle, could you share a canister id? What is this call doing? It looks like “save team” is also setting a compute allocation? Is it also growing memory a lot by any chance? What is your canister’s freezing_threshold?

The exact computation (from the interface spec) is

freezing_limit(compute_allocation, memory_allocation, freezing_threshold, memory_usage, subnet_size) = idle_cycles_burned_rate(compute_allocation, memory_allocation, memory_usage, subnet_size) * freezing_threshold / (24 * 60 * 60)

A 1% compute allocation should roughly cost 35T a month, so if your freezing_threshold is 30 days, I would expect it would require at most 35T extra cycles to increase the compute allocation by 1. Maybe your freezing_threshold is much bigger than 30 days? If your freezing_threshold would be configured to 300 days, then your canister would need to have a balance of ~350T so you can set a 1% compute allocation without immediately freezing your canister.

jamesbeadle · October 7, 2024, 10:46am

So here are my canister ids:

{
  "OpenFPL_backend": {
    "ic": "y22zx-giaaa-aaaal-qmzpq-cai"
  },
  "OpenFPL_frontend": {
    "ic": "5gbds-naaaa-aaaal-qmzqa-cai"
  },
  "OpenWSL_backend": {
    "ic": "5bafg-ayaaa-aaaal-qmzqq-cai"
  },
  "OpenWSL_frontend": {
    "ic": "5ido2-wqaaa-aaaal-qmzra-cai"
  },
  "data_canister": {
    "ic": "52fzd-2aaaa-aaaal-qmzsa-cai"
  }
}

So I’m not getting that error any more, it may come back. I get an error that I am out of cycles when saving:

So when I save (This is the first save) I create a canister:


    private func createManagerCanister() : async Text {
      Cycles.add<system>(50_000_000_000_000);
      let canister = await ManagerCanister._ManagerCanister();
      await canister.initialise(controllerPrincipalId, fixturesPerClub);
      let IC : Management.Management = actor (NetworkEnvironmentVariables.Default);
      let principal = ?Principal.fromText(controllerPrincipalId);
      let _ = await Utilities.updateCanister_(canister, principal, IC);

      let canister_principal = Principal.fromActor(canister);
      let canisterId = Principal.toText(canister_principal);
      
      activeManagerCanisterId := canisterId;
      return canisterId;
    };

But I have cycles in the canister calling this function:

and

It’s probably something in my code so I’m debugging.

jamesbeadle · October 7, 2024, 11:10am

So I haven’t added any cycles, same call, different error:

But I usually get this reject code undefined when interacting with management canister functions…

But the cycles haven’t changed:

berestovskyy · October 7, 2024, 12:59pm

Hey James,
The Internet Computer uses prepayment model, which I agree, sometimes might create challenges. I’m not a Motoko expert, so I’m assuming the Cycles.add<system>(50_000_000_000_000); means that we’e trying to create a new canister with ~50T cycles on its balance. Please correct me if I’m mistaken.

The IC prepayment model requires that any operation must leave canister in a state with sufficient cycles to maintain its existence for at least freezing_threshold seconds (~2.6M seconds or 30 days by default).

Given a 1% compute allocation, this means that any operation should leave the canister with at least 10M cycles * freezing_threshold seconds, or ~26T cycles on its balance.

Therefore, having 74T cycles on a balance and creating a canister with 50T cycles would leave the canister with just 74 - 50 = 24T cycles, which is less than the required freezing limit of 26T cycles.

I’ve submitted two PRs to make it more clear in the documentation, and simplify calculation.

jamesbeadle · October 7, 2024, 1:51pm

Did the canister section just get removed from the NNS? I swear it was under here, like an hour ago.

jamesbeadle · October 7, 2024, 1:54pm

Thanks for this will get some more cycles and try again as soon as the NNS canisters section appears again…

Topic		Replies	Views
Proposal idea : short term fix to the scaling issue General	2	137	October 8, 2024
Thoughts on Subnet Optimization and Scalability for Future Growth General	1	56	May 15, 2025
Path forward for subnet splitting and protocol scaling Developers	19	363	October 17, 2024
Voting is now live for a new proposal for the scaling of the world-computer NNS Governance	24	682	April 21, 2025
Canister Load Balancing (Community Consideration) Roadmap	5	1359	September 16, 2021

LAMENT: A tale of constant struggle of what it's like trying to scale on ICP

Some History

Where we currently are

What are we asking for

Related topics