Most of the load was indeed from query calls. But much of this was due to a less-than-efficient orthogonal persistence implementation (that does 3 syscalls for every 4 KB heap page that is written to), which also affected query calls (on the one hand because the heap image has to be reconstructed from a checkpoint plus overlaid deltas; on the other hand because the inter-canister query prototype we’re working on requires orthogonal persistence support for queries too, as they get suspended while awaiting responses and then need to be resumed). Because of that (and because of earlier issues with the pjkjw
subnet) both updates and queries were limited to a single core each, so query throughput was (very) significantly limited.
We have been working on optimizing the orthogonal persistence implementation, particularly for the multi-threaded execution case, which is now over 10x faster in our testing. This should allow us to revert the single-core limitation on pjljw
and, more generally, to support significantly higher query throughput across all subnets.
It would likely be a somewhat simpler problem to solve but it would still require significant effort, unless we’re talking the simplest possible implementation, which is adding replicas to a subnet that execute all updates while handling some query load, but don’t participate in consensus. In which case, doing a full state sync (synchronizing the full state of the subnet) is likely to take on the order of 1 hour (if another replica is not already present in the same data center) and executing updates is also likely to take up significant resources, that the replica may otherwise use entirely for executing queries.
At that point, you may as well add another regular replica that takes part in consensus and scale query capacity that way at the cost of a very tiny increase in consensus latency. Which is something that can be done today (again, after a 1 hour state sync; which would not have been particularly useful in the ICPunks case).
A much more elegant solution would be to be able to state sync single canisters (as opposed to the whole subnet state) and either continue applying deltas every round or else using a single core to execute the same updates as the subnet while using all other cores to process queries. Neither of which is trivial: we can’t currently state sync a single canister (and it may be expensive to apply deltas every round; they could be hundreds of MB that already overloaded replicas will have to send out); as for executing updates, some of those updates might come from canisters on the same subnet, even produced within the same round.
I guess all I’m trying to say is that it’s far from a simple problem to implement efficiently and it would only solve query load problems. Whereas canister migration might be more generally useful (loadbalancing, grouping related canisters on the same subnet or spreading them across subnets to better deal with load, etc.).