Suggested measures to reduce latency and improve ICP scalability

TL;DR

DFINITY’s suggested next steps to improve the handling of high load on certain subnets on mainnet:

  1. Continue focusing on replica improvements
  2. Remove heartbeats from SNS canisters
  3. Propose adjustments to cycles pricing following motion proposal 133388
  4. Propose changes to the target topology to add new subnets

In the mid-term, DFINITY will concentrate on enabling canister migration to facilitate scaling the ICP by adding new subnets and more effectively balancing the load across them.

Background: High mainnet load

The load on mainnet significantly increased since mid September, and continued to grow rapidly in the following month. It is great to see more adoption, but this rapid growth also led to increased latency: on many subnets, the replicas could not process all messages, leading to high latency and even ingress messaging timing out before they could be processed. This was discussed in a separate forum topic.

There were two main causes that led to the subnet not handling the load well.

In ICP, reaching agreement on messages and the actual processing of those messages are separated. The blockchain decides which messages are accepted, and every time a new block is finalized, an “execution round” is triggered. Every execution round is limited in how much work (measured in “instructions”) it performs, such that it completes in roughly a second. That means that if a block contains a lot of messages that take significant processing, it may be the case that not all messages from that block can be processed immediately in the next execution round, and some messages may wait in canister queues to be processed later. A scheduler component is used to decide which messages out of the available messages in canister queues to process on the different processing threads every round.

The scheduler was ensuring fairness only in selecting which canister has the first chance to run in a certain round on a certain thread. Being chosen as the first canister in a round happens infrequently if there are a lot of canisters that have messages to execute in a certain round. For example, with 20k active canisters on a single subnet, 4 compute threads, and 1.5 blocks & execution rounds per second, every canister would be scheduled first only once every 20000/4/1.5 = 3333 seconds or 55 minutes. Luckily there is often time left in an execution round for other canisters to do work, but the scheduler did not factor that into future scheduling decisions, so this extra time in the round was not fairly distributed. This led to many canisters only being scheduled very infrequently with such a high amount of active canisters.

This scheduler issue was addressed by changing the logic of the scheduler to also factor in which canisters could use “leftover computation” in execution rounds even though they were not scheduled first. However, this change alone performed worse on the workload we are seeing on mainnet today, which brings us to the second problem.

Every canister is executed in a sandbox for security reasons, making sure canisters are isolated and cannot maliciously read data from other canisters, or bring down the replica if they find and exploit a weakness in the WebAssembly runtime. However, starting and stopping all these sandboxes is additional work. For this reason the replica maintains a cache of sandbox processes to help speed this up. While executing canisters, the replica might need to evict some of the older sandbox processes if there’s no room in the cache and bring up new ones. If the replica tries to do this too quickly, the system slows down due to thrashing. This limits how many distinct canisters can process messages in every execution round.

The load on mainnet consisted of a huge number of small heartbeats for many different canisters, so exactly what is difficult to process for the replica.

The situation was finally significantly improved on October 15th with a new replica version, which included the scheduler changes and an increase in the amount of sandboxes the replica keeps cached. Since this version, the situation on mainnet has drastically improved. We see very few ingress messages expire on all subnets, but still increased latencies on certain subnets.

Suggested short term measures to reduce latency

DFINITY is working hard on bringing further improvements and considers this a top priority. We propose the following immediate next steps:

  • Continue to focus on replica improvements that handle many active canisters better. Concretely, the plan is to further increase the amount of canister sandboxes that can remain cached, such that the replica can better handle a large set of active canisters. Additional improvements in the scheduler will also be considered as a second step.

  • A big part of the load originated from many instances of the SNS, each of which contains multiple heartbeat canisters. Note that there are way more instances of the SNS canisters than just the ones that went through a decentralization sale. New versions of the SNS canisters have been created and are in the process of being adopted by the NNS, and the new versions no longer use heartbeats but timers, which significantly reduces the cycles consumption and load on the system incurred by these canisters. DFINITY will try to encourage and support upgrading all SNS instances to these new versions.

  • It was observed that this specific load pattern caused a lot of load, but did not burn a huge amount of cycles. A guiding principle should be that a subnet at capacity burns more cycles than node providers receive in rewards. This was also brought forward in adopted motion proposal 133388. DFINITY will propose a concrete change to certain cycles costs, with the aim of ensuring that all workloads have a cycles cost that is in line with the load it causes on a subnet. We’ll share this on the forum later today. link

  • DFINITY will propose updating the target topology to include more subnets, and if adopted, propose to create more subnets such that more compute capacity is added to ICP. We’ll discuss this in more detail on the forum in the coming days. link

Outlook: ICP scalability and load balancing

The above is mainly focused on ensuring that each subnet can process a lot of load, and that this load costs a proportional amount of cycles. However, this is not the core of ICP’s approach to scalability. ICP’s high level approach to scaling can be summarized as follows:

  • Every subnet has finite “replicated” capacity, but capacity can grow by adding subnets
  • Load can be balanced over subnets
  • A subnet’s query capacity can grow by adding more nodes to the subnet

In ICP today, the weakest link in this story is load balancing: some subnets were highly loaded while others were not, but the load could not easily be balanced over the subnets. There is basic support for splitting a subnet into two via a sequence of NNS proposals, which creates a new subnet that takes over half the load of an existing subnet. There are a few challenges with subnet splitting.

  • Subnet splitting is driven by the NNS. This means that individual dapp controllers cannot make their own decision on what latency is acceptable.
  • One major challenge with this is that colocation of canisters matters. Two canisters on the same subnet can communicate much faster and with higher throughput than canisters on different subnets. Similarly, composite queries currently only work with canisters on the same subnet. This means that it is critical that in a subnet split, the right canisters remain together on the same subnet post-split. However, there is not complete freedom in defining how the canisters are split due to how routing canisters to subnets works, and it’s difficult for NNS participants to know what a “good” split would be.

Things would be simpler if canister controllers could individually decide to migrate to another subnet. Today, ICP does not offer built-in support for canister migration, meaning you can only manually migrate your data to a new canister id on another subnet. Changing canister id however can be impactful because, for example, it changes how others talk to your dapp and changes the threshold signing key your canister has.

Next steps: DFINITY plans to focus in the mid term on supporting canister migration natively in the protocol. That means a canister controller could choose to migrate their canister to another subnet without changing the canister id. We believe this feature would unlock the full scalability of ICP as this would provide a full solution to balance load over subnets. Every developer can make their individual subnet choice, and ensure that canisters that benefit from being colocated remain on the same subnet. Canister migration would also enable better utilization of subnets. For example, compute-heavy dapps would likely move to subnets with little compute load, and storage-heavy dapps would migrate to subnets with lots of free storage. This will likely lead to every subnet getting a nice mix of dapps that together require a mix of resources, increasing the overall work every subnet does and the cycles it burns. There are still big design challenges to overcome to enable canister migration, so we cannot give an accurate timeline, but this is what we’ll focus on, and we’ll make sure to share the progress.

Discussion

We look forward to hearing your thoughts, and we’re happy to answer any questions.

42 Likes

Hey @Manu!

Thank you for posting this write-up, I am grateful that the community can participate in these conversations, we are learning a lot.

I have a few thoughts and questions on this specific part.

I think the above plan where canister-developers can migrate their canisters to a specific subnet of their choice, is great for subnets where a single entity controls who can install canisters on those subnets. If I am building canisters on the NNS/system subnets, Utopias, or subnet-rentals, then this feature sounds very useful.

However for public subnets where anyone can deploy canisters there are a few scenarios where problems might happen with that plan.

Scenario #1: If a few subnets get full at the same time due to a dapp’s or multiple dapps’ rapid growth, all other dapps on those subnets would quickly at the same time look for a low-load-subnet to migrate to. It’s likely that all other dapps would pick the same subnet (or the same few subnets) to migrate to since they will all be looking for the same low-load qualities/indicators. When that happens, the new subnets can get full just as fast since all other dapps are migrating to them at the same time. There is no way for dapp developers to coordinate with each other on which subnets to migrate to, to balance the load of the whole protocol optimally. This would cause dapp developers to frantically (since their customers are waiting) keep migrating their canisters to different subnets hoping others don’t pick the same subnets. Every dapp would probably end up having to try multiple different subnets to even things out. This would be very inefficient and might even cause more load across the whole network as many dapps are migrating all of their canisters to the same subnets multiple times all at once.

Scenario #2: If some bad actors want to DOS a specific dapp-service, the above plan makes it very easy for them to do that, and there is no way for the dapp-service to stop it. The bad actors can deploy a few thousand canisters running heartbeats to the same subnet as the dapp-service. When the dapp-service tries to migrate to another subnet (under the proposed plan), the bad actors can just follow them and migrate the bad-actor-canisters to the same subnet! And there is nothing the dapp-service would be able to do to stop it.

3 Likes

Thanks for this thoughtful reply @levi!

Before going into the two scenarios, I would like to make it clear that I wouldn’t say everything is fully done when we have canister migration. My point is that I believe canister migration is the most valuable thing we can do now to make scaling ICP easier. You can still imagine many things that would be helpful after that, eg

  • give canisters a better way to deal with temporary surges of activity on a subnet. Right now, the only option is allocations, but this does not fit all use cases. Maybe there can be some option for canisters to express they’re willing to pay more whenever things are busy to get more priority.
  • Some form of automatic load balancing. So developers can choose to migrate themselves, but there is also some automatic thing that coordinates load balancing on ICP (perhaps opt-in?).

These ideas are clearly not fully baked yet, but are just examples of things that could still be improved.

So coming back to your scenarios: i agree that scenario 1 could happen, although i also think it wouldn’t be so bad. Different canisters will likely have different latency requirements, so some would migrate earlier than others. Also more subnets would likely help: if we have eg 50 subnets, and some are suddenly busy, while all others are not heavily loaded, then it seems likely to me that the load would spread. So in other words, the bigger ICP is, the less impactful small bursts of usage are, and the easier it is to spread the load. In the long term, perhaps this automatic load balancing idea that I mentioned above could help here. However, also for that the first step we’d need is the ability to move canisters from one subnet to another.

Wrt Scenario 2: i also agree that this could happen. As mentioned above I think some better mechanism that allows a canister to make sure it remains highly responsive even when the subnet is busy could be helpful. And the proposed cycles cost changes would at least make it expensive to DoS.

So in summary, I’m not claiming everything is perfect once we have canister migration, but I do think it would be a huge step in the right direction and that it’s the first thing we should focus on. Does that make sense, or do you see it differently?

3 Likes

Quick update regarding

  1. Remove heartbeats from SNS canisters

With https://dashboard.internetcomputer.org/proposal/133803 being adopted, the latest SNS wasms are now all free of heartbeats. This means that each newly launched SNS DAO is expected to be much more efficient; please upgrade your existing SNSs as your earliest convenience (via the usual process with SNS proposals).

6 Likes

Thanks to Dfinity for working hard to build a system that remains scalable, available, and as cost-effective as possible.

Curious if that could be a zero-downtime migration?

As others on this thread, I’m not sure about the impact of giving any canister the ability to migrate. It sounds like it could create new attack vectors on the IC, but I understand it’s probably the best solution we have at the moment.

Have we also considered a load balancing feature that will direct traffic to similar canisters (same signature) on different subnets?

1 Like

Thank you for sharing this info and having this discussion with the community.

One concern from our side is that there are certain types of Dapps which simply cannot be migrated to other subnets as its business logic (e.g. key derivation) is tied to one specific subnet already. As such a migration will erase all the client accounts in that Dapp.

We wonder, in this case, is there a way to make sure the Dapps can still run smooth and stable even in the case of heavy loads in its subnet. It looks to us more important to figure out a way to prevent canister spamming in a subnet. Maybe imposing a higher creation fee is a way to try?

1 Like

Zero, no. Even with a canister migration implementation based on the existing subnet splitting implementation (where the canister does not need to be stopped and both the original subnet and the new subnet “collaborate” to provide the usual message ordering and delivery guarantees), you still need to “suspend” the canister on the original subnet; move its state to the new subnet; then resume it. If the canister has 100 GB of state, there is no way to do this in “zero” time, even if you went wild and implemented something akin to the state sync protocol (which can rely on an earlier checkpoint plus deltas to hugely reduce the amount of data that needs to be transferred).

That being said, he alternatives we have been discussing for controller managed canister migration avoid going through the NNS as much as possible (you may still need to update the routing table), so there won’t be support for this kind of transparent migration without stopping the canister. You will (ideally, although you may chose not to) stop your canister; take a snapshot; make a management canister call to migrate your canister; somehow transfer your snapshot to the destination subnet (not yet entirely clear how); restore the snapshot; start the canister on the new subnet. Pretty much all this process can be automated, but it will result in downtime, at least for updates.

A canister with the same (Wasm) signature will not have the same state (e.g. I can’t just start using a different ledger with the same API and expect it to have my balance). So I don’t think this is something generally useful for load balancing. But you can probably implement something like this yourself in the client, there’s no need to wait for the protocol to implement FE load balancing.

Key derivation is linked to the canister ID, not the subnet. So if you can migrate the canister to a different subnet but retain its canister ID, everything should Just Work™.

Compute allocations give you (much of) that. Namely, you get a hard guarantee that your canister gets to execute a “full round” every N rounds (you pick and pay for an N between 1 and 100).

Edit: Let me qualify that a bit: you get a “full round” at least once every N rounds; and regardless of subnet load. I.e. in the average case, you still get to run every single round, as does everyone else; in case the subnet is badly backlogged and the average canister is scheduled once in a blue moon, you get scheduled once every N rounds.

The subnet could still suffer from an induction bottleneck (i.e. too many ingress and XNet messages, so a backlog builds up in the block makers), but with a bit of fairness / load balancing thrown into the block makers, this should be less of an issue. And we can even consider something similar to compute allocation for ingress messages if this turns out to be a problem.

1 Like

I guess an hidden assumption might be that state and compute are co-located in canisters. Another architecture would be to have stateless ‘frontend’ canisters that take the load (serve assets,…). And may call a stateful canisters to handle state transition (canister may be on a different subnet and may have reserved compute to ensure availability). At a later stage, the developer could also decide to add a state to the ‘frontend’ canister to improve availability (with eventual consistency).

My guess is that the ‘unified’ canister model (which is fully CP) will be harder and more expensive to scale.

The nice part is that canisters don’t force you to approach this one way or the other (as e.g. AWS Lambda would). You can decide by yourself whether stateless FE plus stateful BE canisters; or mostly stateless FE plus stateful BE; or the single canister acting as both FE and BE; works best for you. You can spread your canisters across subnets or keep them together on the same subnet (and you’ll eventually be able to shuffle them around, as needed).

Yes, but I don’t really see how to build a highly-available system without the load balancing part. I would like not to have to bet on a specific subnet being available given that subnet DDoS attacks will very likely happen.

You don’t need to bet on a specific subnet/ You can create FE canisters on multiple subnets. And you will be able to (initially manually, then in some sort of automated fashion) migrate the canisters across subnets.

As for the load balancing itself, you can have said canisters collect stats (latency, throughput) on themselves and (eventually) the subnets they are on. Then you can query them from the client (regular non-replicated query) and randomly pick one of the less busy ones. Or have a canister collect said stats and ask it which canister to send your transaction to. It’s less convenient than if the protocol did it for you, sure. But OTOH it gives you more freedom. And it can be built as a general-purpose load-balancing library / system, so you build it once and reuse it everywhere.

Eventually something like this will be built into (or onto) the IC. We just haven’t gotten around to it.

3 Likes

Hi @Manu,

The bigger ICP is, the more bursts of usage there will be at the same time so I don’t think it will be different. Since there will also be more bursts of usage around the whole network, it might even make things more chaotic as the whole of scenario #1 would happen at different parts of the network at the same time, with each instance of scenario #1 increasing the inefficiency of the whole network. Does that make sense?

If the bad-actors can choose which specific subnet to migrate to, then they will just set their bad-canisters to automatically follow wherever the dapp-service goes, even if the dapp-service opts-in to protocol-managed-migration. This would cause an indefinite DDoS that no-one can stop.

How expensive is it, maybe $500-$1000 per day? For the ability to create an indefinite DDoS on any chosen dapp-service, that is too vulnerable in my view.

Would you be ok with hosting something like the internet-identity canister on a public subnet with this configuration?

If dapp-developers can choose which specific subnet to migrate to on public subnets, then that would open up the possibility of scenarios #1 and #2.

What do think if we start with the automatic protocol-managed load balancing, and then, if we figure out a way to let dapp-developers choose which specific subnets to migrate to without opening up the possibility of scenarios #1 and #2, then we can open up that option?

Looking forward to hearing your thoughts.

How will this work with certificates and witnesses? They have some subnet material in the witness don’t they?

This is a threat now so I don’t think giving more options makes things worse.

1 Like

Locking-in a known threat as a feature of scalability might not be the best move.

For the automatic protocol-managed canister-migration and load balancing, canisters will not be able to choose which subnet they live on. Canister creation will always be on a random subnet, and the protocol will control which canisters move to which subnets and when. This takes away the possibility of a DDOS targeting a specific dapp. First, a bad-actor would not be able to choose where to create a canister. To try to get N number of canisters on the same specific subnet, a bad-actor would need to create {N * number-of-subnets-in-the-whole-network} canisters on average. As the network grows, it becomes harder. Second, if someone does do that and turns up the load of a specific subnet, the protocol can spread the load throughout all the subnets in the whole network. The protocol can send the canisters with the newest load-increases onto many separate subnets, and the more subnets there are as the network grows, the easier it is to spread the load. If a bad-actor can’t direct it’s load, it can’t target anything. If someone tries to DDOS the whole network that is a different case, and is solved by creating new subnets and spreading the load, and maybe a way to delete subnets that are no longer in use.

There are many possible load balancing strategies and the protocol will be in a position to handle it as long as the protocol stays with the control of the specific subnets that canisters live on. It can start with a simple: if a subnet reaches some level of sustained load, spread that subnet’s canisters throughout all subnets that have space, and create new subnets if there is no space on other subnets.

If canister-controllers can choose which specific subnets to move to, the protocol would have no chance to manage it.

In my view it looks like the best path forward is to start with the automatic protocol-managed canister-migration and load balancing throughout the network.

1 Like

Fully agreed. This is a point that I myself brought up a number of times. Particularly since, if your goal is to discredit or otherwise negatively impact someone, you don’t need to DoS them for days or weeks at a time. 5-10 minutes a few tines per day during peak traffic is enough to drive any reasonable user away.

Problem is, that there’s no straightforward solution: you may be able to differentiate between attacker and target, but that’s not a given. And if you cannot; or you misidentify the attacker, then you only make things worse (whether it’s by actively blocking the victim; or by charging them insane amounts of cycles for the privilege of being attacked.

I will point out that, given a reasonably fast and efficient implementation of manual migration, there should be no meaningful difference in how long it takes to manually vs automatically moving / load balancing canisters. There’s a higher chance that independent users deciding to migrate their canisters may overshoot the mark compared to a centrally-managed load balancer; but then we may not even want a centrally managed load balancer, so it’s quite possible that a decentralized load balancing implementation will sometimes overshoot the mark (and migrate too many canisters to an apparently idle subnet).

Point is, there is nothing magical about automatic load balancing vs user-driven canister migration. Automating user-driven canister migration simply means having a user-controlled canister (instead of a human) drive the migration.

The other issue is that, as per the above, automatic load balancing requires all the moving pieces of user-controlled canister migration; plus a controller on top. So if we were to go straight for the former, it would just mean a longer time without any means of balancing load. And a first implementation that will likely be unable to handle all kinds of dapp architectures, resulting in dapps being spread out over dozens of subnets and in the tens of seconds because we decided that 5 second latency was too much.

What Manu is suggesting is to build automatic load balancing, bot in an incremental fashion. IMHO that’s a lot better than deciding we know exactly what’s needed and trying to deliver directly the finished product, on time and under budget.

1 Like

That’s simply how the certified state is certified. The subnet threshold signs the certified state, so that when you look at e.g. an ingress response you get

  • the response itself, as part of a pruned state tree;
  • a witness that is basically a bunch of hashes that are consistent with each other plus the hash of the ingress response; and
  • a certificate from the subnet, showing that the hash at the root of the witness has indeed been signed by a supermajority of the subnet’s replica and including a delegation from the NNS showing that this is an actual IC subnet.

Moving a canister to a different subnet simply means that it’s now a different subnet certifying the ingress message. So you get a different state tree (which is always the case, even on the same subnet if you query it in different rounds); a different witness that matches this new state; and a certificate for the new root hash, except it’s signed by the new subnet and the NNS delegation showing that this new subnet is an actual IC subnet. So NNS -> subnet A -> witness1 -> ingress response changes to NNS -> subnetB -> witness2 -> ingress response.

1 Like

That does not work. Two canisters on the same subnet are often able to do several messaging roundtrips within the same round. Two canisters on different subnets need something like 5 or 6 rounds for a roundtrip; and this is a physical limitation, not something that can be trivially optimized when we get the chance.

If we were to force every new canister to be created on a random subnet; and/or randomly shuffle canisters belonging to an app and designed to work together across subnets; this would make it impossible to implement certain kinds of architectures. At the very least, you’d need to build monolithic (i.e. single canister) dapps, which limits you to a single thread of execution; whereas, if you carefully break down your application into components and across canisters you can handle an order of magnitude more load.

It is tempting, when trying to solve a problem (such as IC load balancing) to only look at the problem at hand and design a solution that is exclusively for it. And it’s possible, for people not going over the same limitations in their mind day in and day out, to not be acutely aware of other constraints or characteristics of the protocol. OTOH me, having my head stuck in it all the time, will often miss obvious, clearly beneficial changes or approaches.

So while I still don’t see automatic load balancing as a silver bullet, keep the ideas coming. It is definitely going to solve a lot of problems, but it needs a lot of work even to get it off the ground; and if we give dapps the possibility to control how they’re distributed across subnets, which we must, then there will still be the possibility for abuse the system. Engineering is all about compromises.

1 Like