Subnets with heavy compute load: what can you do now & next steps

Hi everybody! In the recent weeks, we’ve seen the demand for computation on ICP significantly increase. This is of course great news, but also leads to some challenges. We now see that there are quite some subnets that are running at or almost at maximum compute capacity. This means that canisters have to wait for their turn to compute and therefore messages are processed with much higher latency. You may notice ingress messages get rejected with an error saying “timed out waiting to start executing”.

The main affected subnets are

  • lspz2
  • fuqsr
  • 6pbhf
  • e66qm
  • bkfrj
  • 3hhby
  • nl6hn
  • opn46
  • lhg73
  • k44fs

You can see which subnet a canister is on by looking at the canister’s page on the dashboard, eg https://dashboard.internetcomputer.org/canister/ryjl3-tyaaa-aaaaa-aaaba-cai.

What can you do now
If you are a canister developer experiencing issues from being on such a busy subnet, there are two main things you can do.

  1. migrate your canister to another subnet, by creating a new canister on a more quiet subnet (you can see the load of subnets on the dashboard, key things to look at are canister count, million instructions executed per second, and state), and migrating the state of your canister over to the new subnet (which requires support on your canister, there currently is no built in support for this).
  2. Set a compute allocation. A compute allocation is a way for a canister to reserve compute capacity. They are expressed in percentage point, so setting a compute allocation of 5 means your canister will be guaranteed run 5% of the rounds, so at least once every 20 blocks, even if there are many other canisters that want to compute. The cycles cost is roughly 35$ per month per percent point for a compute allocation on a 13-node subnet.

Are there improvements to the protocol coming?
Yes, DFINITY is working on two short term improvements.

  1. the scheduler, which decides which canisters get to run when, can be improved. The scheduler currently only guarantees fairness in the sense of being the first to get scheduled in a round. However, the canister that gets the first chance to run may not need the full round duration, and in that case other canisters can do more work, but this is not factored into future scheduling. This means that under certain workloads, some canisters get significantly more compute time than others. We plan to upgrade the scheduler to slightly change the definition of fairness, which should reduce latency for canisters on busy subnets. This change may already be included in the replica version that dfinity proposes later this week.
  2. The canister execution layer limits the amount of work it does every round such that it ends in a timely fashion (it aims for at most 1 second). However, it looks like for certain types of workloads, in particular with many distinct canisters processing small messages, it is too conservative in the work it schedules, meaning it could actually do a lot more work without exceeding the desired time limit. That means that with some adjustments to the bookkeeping, the compute throughput of a subnet could significantly increase, alleviating the congestion. We hope such an improvement can come in the coming weeks.

Are there other ideas for improvements?
Yes, there are many ideas. Subnets can be split, we can imagine further improvements to the scheduler, we could add other ways for canisters to request priority on busy subnets, we could further increase the compute capacity of subnets by using more cores, make migrating canisters to a different subnet easier, etc. However, for now I believe that it’s best to focus on the two short term improvements mentioned above, and then see where we stand.

Please feel free to ask questions here, and we’ll keep this thread up to date on developments wrt these busy subnets.

34 Likes

Are subnet ranges two dimensional? In the case of subnet splitting, if I have two canisters on a subnet but there are canisters in between it would be nice to be able to move them to a new subnet together.

3 Likes

The routing table assigns ranges of canister IDs to subnets. By default, every subnet starts off with something like 1M canisters (so canister IDs 0 through 999,999 are mapped to the NNS, canister IDs 1,000,000 through 1,999,999 to whichever subnet was created after the NNS and so on).

Subnet splitting allows for arbitrary ranges (down to single canisters) to be retained by the original subnets, with the remaining canisters being migrated to a newly created subnet. So technically you could have all even canisters stay behind and all odd canisters migrate. However, there are two issues with that:

  1. We don;t know which canisters belong together. There is no concept of “canister group” or “dapp” from the point of view of the protocol (you may know that canisters A, B and C are part of dapp X and canister D is dapp Y, but no one else does).
  2. Even if we did know, it would probably be a bad idea to go down to single canister granularity. Right now the routing table holds exactly one range for every subnet (except for the NNS and the II subnets, after the II canister was manually migrated between them; they now each have two ranges). Meaning that the routing table is doing just fine with something like 40 ranges defined.
    If we were to take the extreme approach described above and retain all even numbered canisters while migrating all odd numbered ones, we’d go from 40 routing table entries to at least tens of thousands. If we did this repeatedly, we’d end up with a humongous routing table and significantly increased message routing times. So we’d first have to come up with a more scalable routing table implementation.
1 Like

Hey Manu, thanks for providing a few options forward here.

The current fee for 1 percent compute allocation per second is 10M cycles. If, as per your suggestion I wanted to use a 5 percent compute allocation and guarantee execution once every 20 blocks (every few seconds), then this would cost over 4 trillion cycles a day.

10 million * 5 percent compute allocation * 86400 seconds per day = 4.32 trillion cycles per day. This is a significant cost increase for most apps, especially if they provide relatively simple and cost-effect logic (on the order of millions or single digit billions of cycles burned per day).


In most of these cases with heavy compute load, it seems like it’s just one application/party that’s causing the load.

Instead of asking pre-existing canisters (with less heavy usage patterns) to move or pay a higher cost to stay, maybe it makes sense to start offering reserved subnets tailored to high-usage/compute apps? There’s a number of previously existing applications on many of these subnets and it’s certainly easier to ask one party to move than to get every single independent party to move.

16 Likes

Good that it is possible, but yes…seems like a scalability issue. I was thinking about a system where someone gets in over their head on a ‘shared’ subnet and would want to move to a reserved subnet without having to change all the canister IDs.

5 Likes

The upcoming scheduler improvement should (famous last words) significantly improve the situation with the currently observed workloads (something like 1k canisters with heartbeats per subnet). Without a compute allocation, the average canister should get scheduled something like once every 10 rounds (or every 5 seconds) instead of once every few minutes.

Of course, it’s still possible to imagine (or deploy) workloads that would restore the status quo (e.g. if someone deployed thousands of canisters that just ran for loops until they trapped due to running out of instructions. But subnets are basically shared virtual machines (just as if you shared an AWS VM with a few thousand other services).

Which is why we must devise ways other than asking people (whether the ones producing the load; or everyone else) nicely to please migrate to a designated subnet. Compute allocations are one such approach (where you essentially reserve a fraction of the subnet for your canister; which is why they’re not exactly cheap). Subnet splitting is another (admittedly blunt) tool. We will need to devise other ways of dealing with load (e.g. explicitly designating interactive vs. batch canisters or endpoints; granular, fast and ideally automated canister migration; congestion pricing; canister groups; etc.). But all of these will require time and effort to design and build. And we need to do this incrementally: apply the proposed improvements; see what the biggest remaining pain points are; decide what to improve next; rinse and repeat.

10 Likes

That would be canister migration. Which can either be an extension of subnet splitting; or implemented separately, using some form of subnet storage. Either way, it’s not a trivial undertaking. The existing subnet splitting process took something like half a year to implement; and is still a highly manual (but verifiable) and slow process (the subnet would be down for hours). So it’s all doable, we just need to decide on its priority relative to other features (e.g. best-effort messages or small guaranteed responses); and then spend the 6 months to a year designing and implementing whatever we decide should be built next.

6 Likes

What is the number of available instructions that can be shared by all canisters on a single subnet per block?

Let’s say someone was deploying canisters that run at 100 compute allocation and just run for loops - each canister has access to 5B instructions per round (before DTS kicks the computation to the next round).

How many canisters running 5B instructions at full compute would it take to hit subnet compute capacity, and would trying to increase additional canisters to have a 100 compute allocation fail at some point (because the subnet would have no additional compute to offer)?

1 Like

The subnet has 4 virtual cores and 3 of them can be reserved via compute allocation.

3 Likes

I forgot about DTS. In the Compute allocation 0 scenario, if a round hits DTS, does it get priority for scheduling? Or is it possible it’s going to have to wait in round-robin to pick up again?

1 Like

The AWS comparison falls a bit short here, when we’re talking about everyone having to share the same machine with the same capacity and processing capabilities. While a one-size-fits-all is nice to start, at some point developers want flexibility to be able to choose the box (subnet) that works best for their specific use case instead of needing to conform to a one size fits all which is why AWS offers a large variety of different compute options with clearly defined processing power that isn’t dependent on unpredictable 3rd party tenant usage.

There’s plenty of examples of various replica optimizations rolled out to ICP mainnet this year specifically to help scale a specific canister usage pattern, such as the ones made by the replica team to improve checkpointing performance on the OpenChat subnet with many canisters, but where most of them are idle during a round (low compute).

This optimization doesn’t help much in the Bob/Yral case where you have heartbeats or frequent timers running in all of the canisters (many active canisters with high compute) and can actually cause a hit to performance (if I remember the Global R&D slide correctly). I’d imagine that AI subnets will fall into this category as well.

Trying to then engineer for both scenarios you end up getting less performance benefit than if you were to build two different subnet types (high performance vs. low compute). There’s also the “storage” subnet type that’s been discussed regularly, although that is more of a hardware/network topology configuration different.

I’m all for batch endpoints (shameless :electric_plug: for canister_status endpoint and icrc transfer/transfer_from endpoints), but this doesn’t solve the compute capacity problem. Right now, I’d love a solution that provides different subnet types for different intended usage patterns (different hardware & software optimizations), and then provides developers with a happy path to explicitly migrate their canisters from one subnet (type) to another.

5 Likes

That may well be, but right now we’re nowhere near having to worry about that. The first problem we have is that the scheduler implementation has some issues that need to be fixed.

The second, equally big problem is that whatever you optimize for, there is no way to get much useful behavior out of a subnet with tens of thousands of canisters, as soon as a few thousand of those actually want to get work done every single round: At that point, it doesn’t matter whether the subnet is optimized for latency or for throughput, you’re not going to get either. It’s as simple as that.

6 Likes

The scheduler segregates DTS executions from “new” executions (we can’t call them “non-DTS”, because for all we know any new message execution could turn into a DTS execution). This is because DTS executions may need to be aborted before a checkpoint, so it’s simply impossible to provide the same kinds of guarantees for them as for short executions (e.g. given one core and 2 canisters that each want to be scheduled 50% of the time, as soon as you abort one canister you’ve broken the promise to one or both). Instead, we provide a strict guarantee for short executions and a somewhat less strict guarantee (e.g. 50% of rounds that did not result in an abort) for DTS executions.

So, after taking into account the total compute allocation of all canisters with DTS executions vs. the total compute allocation of all canisters with “new” executions, some cores are allocated for DTS executions, the remaining cores for “new” executions (which is why you can only allocate 3 out of 4 cores: if DTS should get 1.4 cores and new executions 1.6, then they each get 2).

Finally, within each group, we end up going round-robin over the canisters within (with individual canisters moving back and forth between groups, as necessary).

Hope that answers your question.

3 Likes

Looking forward to seeing the impact of these changes next week!

There seems to be friction here between trying to support a canister per user model (100k-100 million canisters) and pushing teams to develop architectures that align with the current scalability strengths of the IC.

If the number of independent, active canisters is truly a hard scalability limitation, then using the example of incentivizing behavior through pricing (compute allocation pricing as a stick vs. a carrot), the protocol could charge significantly more for canister creation (by 10-20x to 1-2 trillion) and reservation (fee per live canister over time) to encourage mono/several canister architectures instead of those that spin up 100k-1million canisters. I wouldn’t use the graded storage reservation mechanism pricing (charge more above a certain threshold), b/c this would just incentivize apps to use up initial canister resources on multiple subnets.

3 Likes

I’m all in favor of charging more. Current fees for virtually everything are up to one order of magnitude too low anyway; and beyond that, as you, I would love it for high / progressive fees to be used to disincentivize specific behavior (although, again same as you, I have doubts regarding progressive fees that penalize canisters that happen to be on busy subnets in the absence of quick and easy canister migration).

Regardless, I also have my doubts regarding the likelihood of success of a proposal to raise fees by 10x for anything.

4 Likes

We discussed on the Event Utility WG that it might be interesting to allow the attachment of cycles to a call to specifically bid on bumping to the front of the line for the target canister’s scheduler(I guess in theory all incoming requests would be collaborating on the bid since messages have to be scheduled first). Not sure how this would work for ingress…but I’ve always been for canister wallets anyway…But I guess this whole issue is a huge threat to that canister wallet concept as well. No one like’s their wallet to get BoB’d.

2 Likes

I have canisters that control neurons via ECDSA signed HTTPS outcalls which requires the public keys of canisters. When canisters are migrated to other subnets, does this effect their key signatures in any way? I want to confirm that when canister migration becomes available, I’ll be able to migrate canisters that control neurons without losing access to those neurons.

6 Likes

Small update: the next replica version that DFINITY will propose (link) will include changes to the scheduler and the round limits, so both of the potential short term improvements that I mentioned in the first post. This means that we may see some improvements on the busy subnets next week.

15 Likes

Great news, thanks @Manu! I’m looking forward to having a gander at the changes this weekend :smiley: