Long Term R&D: Subnet splitting (proposal)

diegop · December 7, 2021, 3:48am

1. Summary

This is a motion proposal for the long-term R&D of the DFINITY Foundation, as part of the follow-up to this post: Motion Proposals on Long Term R&D Plans (Please read this post for context).

This project’s objective

In the near future, subnets with heavy load can be split into two subnets via multiple NNS proposals, to balance the load. This project is about upgrading the mechanisms for subnet splitting such as splitting via a single proposal or subnet splitting taking into consideration which canisters together form a single dapp, and should remain on the same subnet

2. Discussion lead

Manu Drijvers

3. How this R&D proposal is different from previous types

Previous motion proposals have revolved around specific features and tended to have clear, finite goals that are delivered and completed. They tended to be measured in days, weeks, or months.

These motion proposals are different and are defining the long-term plan that the foundation will use, e.g., for hiring and organizational build-out. They have the following traits and patterns:

Their scope is years, not weeks or months as in previous NNS motions
They have a broad direction but are active areas of R&D so they do not have an obvious line of execution.
They involve deep research in cryptography, networking, distributed systems, language, virtual machines, operating systems.
They are meant to match the strengths of where the DFINITY foundation’s expertise is best suited.
Work on these proposals will not start immediately.
There will be many follow-up discussions and proposals on each topic when work is underway and smaller milestones and tasks get defined.

An example may be the R&D for “Scalability” where there will be a team investigating and improving the scalability of the IC at various stages. Different bottlenecks will surface and different goals will be met.

3. How this R&D proposal is similar to what we have seen

We want to double down on the behaviors we think have worked well. These include:

Publicly identifying owners of subject areas to engage and discuss their thinking with the community
Providing periodic updates to the community as things evolve, milestones reached, proposals are needed, etc…
Presenting more and more R&D thinking early and openly.

This has worked well for the last 6 months so we want to repeat this pattern.

4. Next Steps

Developer forum intro posted
1-pager from the discussion lead posted
NNS Motion proposal submitted

5. What we are asking the community

Ask questions
Read 1-pager
Give feedback
Vote on the motion proposal

Frankly, we do not expect many nitty-gritty details because these are meant to address projects that go on for long time horizons.

The DFINITY foundation’s only goal is to improve the adoption of the IC so we want to sanity-check the projects we see necessary for growing the IC by having you (the ICP community) tell us what you all think of these active R&D threads we have.

6. What this means for the existing Roadmap or Projects

In terms of the current roadmap and proposals executed, those are still being worked on and have priority.

An intellectually honest way to look at this long-term R&D project is to see them as the upstream or “primordial soup” from which more baked projects emerge from. With this lens, these proposals are akin to asking, “what kind of specialties or strengths do we want to make sure DFINITY foundation has built up?”

Most (if not all) projects that the DFINITY foundation has executed or is executing are borne from long-running R&D threads. Even when community feedback tells the foundation, “we need X” or “Y does not work”, it is typically the team with the most relevant R&D area that picks up the short-term feature or project.

diegop · December 7, 2021, 4:47am

Please note:

Some folks gave asked if they should vote to “reject” any of the Long Term R&D projects as a way to signal prioritization. The answer is simple: “No, please, ACCEPT”

These long-term R&D projects are the DFINITY’s foundation’s thesis at R&D threads it should have across years (3 years is the number we sometimes use internally). We are asking the community to ACCEPT (pending 1-pager and more community feedback of course). Prioritization can come at a separate step.

Manu · December 8, 2021, 1:02pm

Hi all! I’m Manu, I’m the eng manager of the consensus team at the DFINITY foundation. I will soon post an outline for the plan of “subnet splitting”. Any questions, comments, or suggestions to improve the plan are very welcome! I look forward to the discussion.

Manu · December 8, 2021, 1:05pm

Subnet splitting (one pager)

Objective

The internet computer is designed to have unbounded capacity by scaling out to different subnet blockchains. Each subnet however has a limited capacity:

There is a bound on how large the replicated state (combined state of all canister smart contracts on subnet) can grow
There is a limited amount of canister smart contracts that can be installed on the subnet
Every subnet has one blockchain, and blocks are of bounded size, so the bandwidth of accepting updates is bounded
Every replica of a subnet should process all update calls that reach a subnet, so it has bounded processing capacity.

The load on the different subnets varies wildly. For instance, subnet jtdsg has a replicated state size of 170 GB, while most other subnets hold less than 10 GB. As it stands today, there is no convenient way to balance load between subnets.

We propose to address this issue by introducing a new NNS proposal which would “split” a subnet into two subnets. The replicas are divided into two groups, each of which will become a separate subnet. The canisters are distributed over the two subnets in a similar fashion, such that each of the two “child” subnets only have half of the load before the split. Since all replicas already have the state of all canisters that were on the subnet, no slow transfer of state is required and the subnet downtime due to splitting should be minimal.

Both child subnets are half as large as the original subnet, which means they could be less decentralized / secure. To address this, it seems prudent to only split large subnets. A small subnet could be split by first adding new replicas to it.

Why this is important

If there is too much load on a single subnet, the canister smart contracts on that subnet will suffer from degraded performance. This has already happened: subnet pjljw has been under significant computational load multiple times, which led to all canisters on that subnet experiencing higher latencies.

Outline of proposed technical solution

To reduce the implementation effort, the following design relies strongly on existing mechanisms such as replica upgrades. The overall design is a trade-off between canister downtime and ease of implementation; possible improvements that can be taken up in a later stage are mentioned in the text.

The following steps describe the procedure to split a parent subnet A into two equal-size child subnets A and B. Even though one of the child subnets inherits the identifier of the parent subnet, both subnets will operate under new threshold keys after the split.

Expand the parent subnet: Nodes are added to the parent subnet A using existing NNS proposals until it reaches at least twice the regular size for this type of subnet. The exact number of nodes will be determined such that a >⅔ threshold-signed certification in the expanded parent subnet guarantees that at least one honest node in each of the child subnets has a full copy of the state.
One or more NNS proposals will add new nodes to subnet A. Once accepted, the new nodes will fetch the full state from existing nodes as usual.
NNS proposal to split subnet: A new NNS proposal type will be created, where the proposal describes which nodes and which canisters are moved to which of the two child subnets after the split. Once accepted, the registry will mark the subnet with a new “splitting” flag that temporarily prevents concurrent changes to the subnet (e.g., moving nodes or canisters in and out of the subnet) until after the split.
DKGs for child subnets: The NNS subnet performs DKGs to generate fresh threshold keys for both child subnets A and B, assigning key shares to the nodes assigned to A and B.
Parent subnet stops and creates final CUP: Parent subnet A stops processing update calls and creates a final catch-up package (CUP), as is currently done before a replica upgrade. Note that the parent subnet continues to process query calls, so that canisters are essentially running in “read-only mode”.
NNS obtains the final CUP of the parent subnet. After the splitting proposal is executed, the NNS will look to obtain the final CUP of the parent subnet. Once it obtains this CUP, the NNS will execute the split and update the registry to replace the parent subnet with the two child subnets. For each of the child subnets, the NNS constructs a “genesis” CUP in the registry, instructing the subnet from which state to start. This genesis CUP contains the state from the final CUP from the parent subnet, meaning that the child subnet continues from the final pre-split state, and it contains the newly generated threshold key material. Additionally, the “routing table” (that maps canister ids to subnets) is updated to split the canisters over the two child subnets.
There are multiple approaches on how the NNS can obtain the final CUP of the parent subnet. Ideally, it securely fetches it itself from the parent subnet, but a simpler intermediate solution might be to deliver this CUP via a second NNS proposal that voters can verify.
Child subnets A and B restart: All replicas in the child subnets restart, without erasing their execution state (as they would during an upgrade), from the Genesis CUPs found in the registry. All replicas purge the state information of canisters that are not assigned to their subnet.
Special care needs to be taken of in-transit cross-subnet messages and responses from and to canisters that moved to child subnet B. Messages that were in the outgoing streams of these canisters at the moment the parent subnet was stopped continue to be offered by child subnet A. Messages and responses on incoming streams, however, will be met with a new REJECT signal, until the sending subnet updates its routing tables to child subnet B. To ensure ordering guarantees, subnet B initially runs canisters in “starting state”, meaning that all open call contexts for a canister have to be closed before it can accept new calls. Once all call contexts of a canister are closed, the canisters can transition into the running state and continue processing new calls as usual.

Discussion leads

The motion proposal is driven by @derlerd-dfinity1, @gregory, @Manu, and other team members will also be available for discussion.

Skills and Expertise necessary to accomplish this

To achieve the goal of being able to split subnets is clearly a broad R&D effort. Despite the high-level design presented above we expect that many open questions will need to be answered. Answering these questions will require the involvement of many teams all across the IC stack. In addition it also requires broad input from the community to guarantee that we also end up with a usable solution for both canister developers and end users that meets the expectations of the community.

What are we asking the community

Review comments, ask questions, give feedback
Vote accept or reject on NNS Motion
Participate in technical discussions as the motion moves forward

diegop · December 20, 2021, 7:19pm

NNS Motion is live: Internet Computer Network Status

JaMarco · September 12, 2022, 8:19am

What would be the mechanism that satisfies this requirement? I think this is crucial to have at some point.

Zane · September 12, 2022, 5:59pm

Dfinity was working on that mechanism a few months ago, most likely some cfg where you define canisters you always want to be on the same subnet:

infu · September 12, 2022, 10:40pm

My situation:
Anvil’s canisters are in two ranges currently on mupz.
When you convert the Principal to Nat:
from 17828611 to 17830659 total 2048
from 17830671 to 17836454 total 5783
Registered long ago, before I knew I could register them on different subnets. Most of them are unused but part of the auto-scaling.
I am planning to register ranges in other subnets for auto-scaling and not use the registered canisters for now.
Also I am not sure I want the whole service to be in one subnet. As far as I understand being on more subnets will increase the throughput.

Since 98% of these canisters are empty → Perhaps moving them to other subnets will be a lot easier? If I could assign empty canisters to different subnets, that would solve my problem and give me the most flexibility.

free · April 6, 2023, 9:17am

[This is a follow-up to a 5 minute presentation made yesterday during the (internal) DFINITY Global R&D regarding the first, MVP version of subnet splitting that we’ve started working on. There were a bunch of follow-up questions during the presentation and a suggestion was made to move the discussion to the forum (cc @lastmjs, @ggreif). Please refer to Manu’s subnet splitting one-pager above as a baseline.]

The effort was kickstarted by a “full subnet splitting” design that defines a process driven entirely by a series of NNS proposals – doubling the size of the subnet; halting it; updating the routing table; etc. – with some synchronization points in-between (e.g. starting the two subnets must wait for the updated routing table to propagate to all subnets).

The MVP that we are building first provides for a slower, more manual process. A process still entirely controlled by more or less the same set of NNS proposals; and crucially, still allowing for end-to-end verification. But e.g. requiring for some of the verification to be performed by neuron holders before voting; rather than automated by canister or protocol logic.

Why build an MVP

The full subnet splitting design calls for significant, coordinated changes across 3 out of 4 IC protocol layers (Consensus, Message Routing and Execution) before we have anything that works.

The MVP on the other hand, is mostly limited to Message Routing, with a bit of Consensus code thrown in. But:

it is a lot less effort, concentrated in one protocol layer;
it is an incremental step towards full subnet splitting (meaning that virtually all the code and process needed for the MVP can be reused as is by the full feature);
and importantly, it gives us a much needed tool for use in case of an emergency.

So far the State Manager team has been able to optimize aspects of the implementation and protocol, to allow for repeated subnet size increases. But this approach can only get us so far. And a tested, documented scaling solution to fall back on may come in handy.

How will the MVP work

This is a grossly simplified summary, as “simply” ensuring message delivery guarantees throughout the process; and providing the mechanisms and tooling for straightforward verification; involve a lot of logic and process.

But, at a very high level:

We make a proposal to halt the subnet at a CUP height. The NNS votes it through.
When the subnet has halted (and certified its final state) we download the state and peel off some of the canisters into a brand new subnet state.
Finally, we perform subnet recoveries on both the original subnet and the new subnet. With enough information in the respective proposals (to certify the genesis states of the two subnets) to allow anyone to verify these genesis states against the final state of the original subnet (already certified by the subnet before halting).

The core differences to what Manu described above are:

No doubling of the number of subnet replicas before halting.
Manual verification (supported by tooling) of the two genesis states, instead of the protocol handling it automatically.

benji · April 6, 2023, 1:15pm

My main concern is cross-subnet behaviour change. Can you clarify what “peel off some canisters” entails?

Same-subnet/cross-subnet calls

If I have multiple canisters relying on the fact that they’re on the same subnet, how is that preserved?
If the NNS proposes to split up my canisters, what recourse do I have?
If my canister relies on it being on the same subnet as other canisters, how do I specify that?
If my canister relies on it being on the same subnet as another factory canister plus all canisters it spawns, how do I specify that?
If my canister relies on it being on the same subnet as all canisters on the same subnet which I’m a controller of, how do I specify that?

Atomicity

How does this interplay with atomic canister groups, when/if we will have that?

Prioritization

Why is this feature prioritized over atomic canister groups?

free · April 6, 2023, 3:02pm

I guess your question 2 is the answer to your question 1: canister groups (or something equivalent) will eventually be used to ensure that (to the extent possible) canisters belonging to the same dapp that want to stick together will stick together after a subnet split.

As for prioritization, canister groups have not been needed, because there is currently no way of migrating canisters across subnets. So your canisters will stay put wherever they were created. It only becomes a necessary feature once canisters can move around. Whereas subnet splitting is something that we might need tomorrow, for all we know.

Going deeper down the rabbit hole, there’s also the issue of routing table fragmentation: currently the routing table has one entry per subnet (plus one entry for the II canister, which was migrated last year). The routing table is basically a map of the form:

{
    canister_0..canister_9999: subnet_0,
    canister_10000..canister_19999: subnet_1,
    canister_20000..canister_29999: subnet_2,
    ...
}

If you now imagine a subnet with 50K canisters split among two dapps (or one large dapp and a whole lot of tiny ones), with every second canister created by large dapp D, then keeping all canister for dapp D on one subnet while splitting all other canisters off onto another subnet, would mean adding 50K entries to the routing table. Which currently has around 50 entries. Which is why the current proposal for canister groups is to have a canister group occupy a contiguous (say 100 canister ID) range, so as to keep fragmentation under control.

Which is why (along with the fact that the subnet splitting MVP is only intended as an emergency tool), any subnet split performed via this process will likely not take into account heuristics such as keeping all canisters with the same controller together, on the same subnet.

But once (or before) we have faster, more automated subnet splitting, we need to think hard about canister groups or other ways of specifying and preserving groups of co-located canisters. Or, conversely, sets of canisters that must be split across subnets, e.g. so they can each get full CPU cores at all times.

Topic		Replies	Views
Long Term R&D: Storage Subnets (proposal) Roadmap	44	5349	January 3, 2024
Long Term R&D: Scalability (proposal) Roadmap	18	4797	April 27, 2022
Long Term R&D: Canister Migration (proposal) Roadmap	10	1523	July 24, 2024
Motion Proposals on Long Term R&D Plans Roadmap	17	8122	June 10, 2022
Long Term R&D: Malicious Node Security (proposal) Roadmap	5	2210	August 16, 2023