Considering that the high traffic incident was mostly because of query calls, read replicas within a single subnet should have really helped (besides the boundary node issues), and it seems to me a much simpler engineering problem to ad-hoc add read replicas to an existing subnet, rather than split up existing subnets.
Subnet load balancing might make sense for when a subnet is storing too much data in memory, or perhaps for update calls. But query calls I think can be handled in this simpler way.
Most of the load was indeed from query calls. But much of this was due to a less-than-efficient orthogonal persistence implementation (that does 3 syscalls for every 4 KB heap page that is written to), which also affected query calls (on the one hand because the heap image has to be reconstructed from a checkpoint plus overlaid deltas; on the other hand because the inter-canister query prototype we’re working on requires orthogonal persistence support for queries too, as they get suspended while awaiting responses and then need to be resumed). Because of that (and because of earlier issues with the pjkjw subnet) both updates and queries were limited to a single core each, so query throughput was (very) significantly limited.
We have been working on optimizing the orthogonal persistence implementation, particularly for the multi-threaded execution case, which is now over 10x faster in our testing. This should allow us to revert the single-core limitation on pjljw and, more generally, to support significantly higher query throughput across all subnets.
It would likely be a somewhat simpler problem to solve but it would still require significant effort, unless we’re talking the simplest possible implementation, which is adding replicas to a subnet that execute all updates while handling some query load, but don’t participate in consensus. In which case, doing a full state sync (synchronizing the full state of the subnet) is likely to take on the order of 1 hour (if another replica is not already present in the same data center) and executing updates is also likely to take up significant resources, that the replica may otherwise use entirely for executing queries.
At that point, you may as well add another regular replica that takes part in consensus and scale query capacity that way at the cost of a very tiny increase in consensus latency. Which is something that can be done today (again, after a 1 hour state sync; which would not have been particularly useful in the ICPunks case).
A much more elegant solution would be to be able to state sync single canisters (as opposed to the whole subnet state) and either continue applying deltas every round or else using a single core to execute the same updates as the subnet while using all other cores to process queries. Neither of which is trivial: we can’t currently state sync a single canister (and it may be expensive to apply deltas every round; they could be hundreds of MB that already overloaded replicas will have to send out); as for executing updates, some of those updates might come from canisters on the same subnet, even produced within the same round.
I guess all I’m trying to say is that it’s far from a simple problem to implement efficiently and it would only solve query load problems. Whereas canister migration might be more generally useful (loadbalancing, grouping related canisters on the same subnet or spreading them across subnets to better deal with load, etc.).
How does the IC currently decide what subnet a newly created canister gets deployed to? Was it pure coincidence that Entrepot, ICPunks, and Dfinance canisters were all sharing the same subnet?
Also, I’m curious how accurate this technical overview from Sept 2020 is in describing the current state of the IC mainnet:
Here are a couple of interesting features that Medium post describes:
The NNS can split subnets to distribute load without interrupting canisters. Wouldn’t this have solved this issue?
Canisters running in “fiduciary” subnets can hold ICP and have a replication factor of 28. AFAIU the recent NNS proposal makes a “data” subnet fiduciary-like but doesn’t actually add a new type of fiduciary subnet.
Unlike updates, query calls run concurrently in different threads. This seems at odds with @free’s comment about updates and queries currently being limited to a single core (unless there are multiple threads running in a single-core environment?).
A canister gets deployed onto the same subnet as the (wallet or other) canister that created it. Wallet canisters get created on random subnets, AFAIK.
Neither of these features is actually implemented yet (although the technical overview makes it sound like they are). The latter is being voted on (the thread you posted) and is already in the works; the former is still under discussion on this thread and internally (we’re still at the stage where we’re trying to come up with the most efficient and safe way to implement it and the roll-out stages, if necessary).
Queries are independent and can run in parallel on mulitple threads. But if you re-read my comment above, somewhere in the middle of that wall of text I mention that the pjljw subnet was limited to running a single execution and a single query thread because of performance issues we encountered with our orthogonal persistence implementation in the presence of multithreading. We (and that doesn’t include me) have worked on improving this over the past couple of weeks and will deploy those changes over the coming week (including reverting the single-thread limit on pjljw).
Thanks for the detailed response. I’ll just let my comments stand and assume my ideas here have been understood and taken into consideration during this design process. It’s definitely not simple, and there seem to be some nuanced tradeoffs.