[This is a follow-up to a 5 minute presentation made yesterday during the (internal) DFINITY Global R&D regarding the first, MVP version of subnet splitting that we’ve started working on. There were a bunch of follow-up questions during the presentation and a suggestion was made to move the discussion to the forum (cc @lastmjs, @ggreif). Please refer to Manu’s subnet splitting one-pager above as a baseline.]
The effort was kickstarted by a “full subnet splitting” design that defines a process driven entirely by a series of NNS proposals – doubling the size of the subnet; halting it; updating the routing table; etc. – with some synchronization points in-between (e.g. starting the two subnets must wait for the updated routing table to propagate to all subnets).
The MVP that we are building first provides for a slower, more manual process. A process still entirely controlled by more or less the same set of NNS proposals; and crucially, still allowing for end-to-end verification. But e.g. requiring for some of the verification to be performed by neuron holders before voting; rather than automated by canister or protocol logic.
Why build an MVP
The full subnet splitting design calls for significant, coordinated changes across 3 out of 4 IC protocol layers (Consensus, Message Routing and Execution) before we have anything that works.
The MVP on the other hand, is mostly limited to Message Routing, with a bit of Consensus code thrown in. But:
- it is a lot less effort, concentrated in one protocol layer;
- it is an incremental step towards full subnet splitting (meaning that virtually all the code and process needed for the MVP can be reused as is by the full feature);
- and importantly, it gives us a much needed tool for use in case of an emergency.
So far the State Manager team has been able to optimize aspects of the implementation and protocol, to allow for repeated subnet size increases. But this approach can only get us so far. And a tested, documented scaling solution to fall back on may come in handy.
How will the MVP work
This is a grossly simplified summary, as “simply” ensuring message delivery guarantees throughout the process; and providing the mechanisms and tooling for straightforward verification; involve a lot of logic and process.
But, at a very high level:
- We make a proposal to halt the subnet at a CUP height. The NNS votes it through.
- When the subnet has halted (and certified its final state) we download the state and peel off some of the canisters into a brand new subnet state.
- Finally, we perform subnet recoveries on both the original subnet and the new subnet. With enough information in the respective proposals (to certify the genesis states of the two subnets) to allow anyone to verify these genesis states against the final state of the original subnet (already certified by the subnet before halting).
The core differences to what Manu described above are:
- No doubling of the number of subnet replicas before halting.
- Manual verification (supported by tooling) of the two genesis states, instead of the protocol handling it automatically.