Subnets with heavy compute load: what can you do now & next steps

As Manu said, the proper way to address this is with canister migration (ideally, along with some way for canister controllers to designate e.g. which canisters should stay together on the same subnet and which such groups of canisters should be hosted by different subnets; plus rules such as “only on GDPR subnets”). Then we could automatically balance load across the whole of the IC. However, something like this is (very optimistically) at least a year out even if we start working on it full time starting on Monday.

The systemic problem you are hinting at is simply the combination of limited resources (after all a subnet is simply a replicated virtual machine that must run, with all of the overhead of consensus, deterministic computation, virtualization, encryption, sandboxing, etc. on top of a replica machine); and up to 100k competing canisters crammed onto a subnet. If you think of a canister as a process (which is what it is) then there is no way to give even 1% of said canisters a reasonable chunk of the CPU, should they all want it at the same time. Try running 1k processes each doing a reasonable amount of active computation on your laptop/desktop (the comparison is more or less fair, because of all the overhead I mentioned).

As things stand, 100k canisters can only work for something like OpenChat, who have their own subnets and thus can more or less control the load. On the average subnet, where there is zero control over how the load is spread among tens of thousands of canisters belonging to thousands of controllers, the only way to deal with spikes (or persistent load, such as the Bob canisters) is to be able to more or less instantly shift load across subnets. There’s nothing more to it than that (except smaller or larger optimizations to be done here or there, but none of which would actually solve the problem of thousands of canisters suddenly deciding they all want sub-2 second latency all at once).

5 Likes