Path forward for subnet splitting and protocol scaling

This post is about a plan that will let the internet-computer-protocol auto-scale the subnets when there is too much load.

A subnet can get full at any time. New canisters can be created on the subnet and/or new traffic can be driven to the subnet, where there is more work being requested of the subnet than what the subnet can handle. This as of now can cause canisters to become non-responsive and basically unusable as a result of something outside the canister-author’s control. When that happens, there must be a plan for what to do. Manually migrating data to a new canister on a new subnet is not a good plan, if the new subnet fills up shortly after it would just be playing cat and mouse, or if the canister is controlling threshold keys or holds tokens in many accounts on many ledgers, it’s not even possible to try.

Subnet splitting has been hindered by the assumption that some canisters need to be able to be on the same subnet with each other. This post is about clarifying and calibrating that assumption.

This post clarifies that canisters need to be able to be on the same subnet-type with each other, but do not need to be on the same subnet.

A single subnet does not scale, but a single subnet-type can scale (across an infinite number of subnets).

The main idea is to get rid of any functionality between two canisters that relies on both canisters being on the same subnet. Specific subnet-ids will not be visible to canisters, only specific subnet-types will be visible to canisters. Any functionality between two canisters that currently relies on both canisters being on the same subnet will be either deprecated or made to work cross-subnet for subnets of the same subnet-type (does not need to be made to work for different subnet types).

  • Composite queries → deprecate or be made to work cross-subnet for subnets of the same subnet-type.
  • Management canister create_canister method → deprecate in favor of the CMC’s create_canister method or just route the management-canister’s-create_canister-method to the CMC’s create_canister method and pass the same subnet-type for the subnet-type parameter.
  • install_chunked_code → be made to work cross-subnet for subnets of the same subnet-type.
  • 10MiB message arg size limit on same-subnet → deprecate, the main use case for this was for installing wasms bigger than 2MiB but now we have chunked wasms so no need for this.

Once that is done, now the path is clear for how to auto scale the subnets. Let’s say we start with subnet-A that has a single canister-range [0…10]. When subnet-A fills up and has too much load, the protocol can split subnet-A straight down the middle of it’s load, create a new subnet: subnet-B with the same subnet-type, give subnet-B half of subnet-A’s load, and possible assign additional ranges to both subnets if the leftover ranges are too small to leave room for future canisters. So after the split we have the ranges: subnet-A: [0…5, 10…15], and subnet-B: [5…10, 15…20]. Now let’s say after some time subnet-A gets full again, the protocol can do the same thing again however this time it can either give one of subnet-A’s two existing ranges to a new subnet-C, or if >75% of the load is localized in one of the two ranges lets say the load is dense within the 2…4 range then subnet-A can split that range down the middle. So then we would have subnet-A: [0…3, 10…15], subnet-C: [3…5, 20…25], subnet-B stays the same: [5…10, 15…20], and of course like last time additional ranges can be assigned to subnet-A and C if the leftover ranges are too small to leave room for future canisters. The protocol can keep doing this when subnets get full as long as there is enough node-machines available, The protocol can auto-scale forever.
A cool part of these dynamics is that the greater the number of ranges that a subnet has, the greater the probability that a load split can take place with giving one or more of it’s existing ranges to a new subnet and getting the same number (or less) of new fresh replacement ranges, without having to increase the total number of ranges for that subnet. This happens since the greater the number of ranges on a subnet, the smaller each range is, and the smaller the probability that >75% of the load will be localized in one range. This means that the number of ranges per subnet will stabilize on average.

Size of the routing table: for 1,000,000 subnets (one-million) with an average of 10 ranges per subnet, the total size of the data in the routing table is 29 * 1,000,000 + 29 * 2 * 10 * 1,000,000 = 609000000 = 609 MB.

:Levi.

1 Like

Even if we got rid of all the APIs that are dependent on canisters being on the same subnet (and removing features is never easy), the issue remains that the messaging throughput and latency are going to be vastly different depending on whether the canisters are on the same subnet, or different subnets. Now we do want to improve the throughput of messaging across subnets in the future (and there are some very concrete ideas how this might be done in medium term), but it’s not realistic to expect the difference to disappear completely. We also don’t have a clear plan on how we could improve the latency significantly (and it’s not clear that it’s at all possible to significantly improve it even in theory).

1 Like

Can you clarify what specific issue you think someone could run into?

In the past few weeks, calls between canisters on the same subnet have taken hours to complete if they were able to complete at all, due to the single subnet being filled up with load. There is nothing there to try to save, the single-subnet architecture does not guarantee low latency, it does the opposite. It’s an illusion.

1 Like

We are glad other devs see eye to eye with us on this. Ultimately, it’s just compute. And it doesn’t matter if it’s used up all by a single day canister or millions of tiny ones.

We’ve already had this conversation millions of times in traditional cloud computing of monolithic fast VMs vs Functions as a Service

The answer, like almost all other things in computing, is “It depends”. There are trade-offs when choosing one against the other.

Our past wisdom has shown us that massively distributed systems are easier to reason, shard and scale with the Functions as a Service approach

I’m not thinking about a specific issue, I’m saying that there’s a fundamental performance difference between calls on the same subnet and across subnets, and that will not go away. Sending a message between two canisters on the same subnet requires just copying some memory around on the same machine. DDR5 memory has a max throughput of 64 GB/s, with a 5GHz bus frequency. Cross-subnet messages require network traffic and consensus; currently we require 1 gigabit/s (~125MB/s) links between nodes, so that’s the upper bound on the throughput for those messages; they also require something like 3 consensus rounds to get an answer, and there are also theoretical lower limits on how fast you can make consensus (assuming we have nodes distributed across the planet). So you’re looking at a ~400x difference in performance of the underlying communication mechanisms.

Of course there is also overhead in the system so we will (edit: meant to say “never reach”, not “reach”!) never reach those exact throughput numbers. Right now, as you point out, the overhead on intra-subnet messages is huge and it’s making the latency extremely high. But it’s something that can (and should!) be fixed and people are working on it. But the fundamental differences above can’t just be ironed out.

2 Likes

There is a limited number of messages that can execute within one round so a canister call to another canister on the same subnet can still unpredictably require a consensus and networking round in between the two canisters’ message-executions. This effect only gets bigger as the subnet’s load grows and canisters compete for round space. The stats you mentioned are only possible to get predictably on a rented subnet only executing a limited number of messages at a time. On the public subnets where anyone can create canisters, it’s not possible to predictably get those numbers.

Another factor here is that the end-to-end ingress message end-user latency difference between calling a canister that calls another canister on the same subnet and calling a canister that calls another canister on a different subnet is much less of a difference (maybe around 3x max?).

For the plan in the original post, we can say that every canister call takes up to the time of a cross-subnet call and sometimes it can be faster if the protocol is able to make it faster (if both canisters happen to be on the same subnet at the time of the call) but canister authors should have in mind that a canister call can always take up to the time of a cross subnet call. The cycles cost of a canister call can be made the same whether or not the canisters are on the same subnet, the cost can be set by looking at the average resource-usage of the canister calls throughout the whole network.

1 Like

Yes this is the case currently (and probably will be to some extent for the foreseeable future), but I don’t think that anyone really wants unpredictable execution latency and throughput as the norm? We won’t get to a predictable environment within a few months for sure, but we can improve gradually.

IIRC, with the sync call endpoint, an ingress call takes around 1.2 seconds to get a response, and the median cross-subnet call is around 6 seconds (though this is largely due to the NNS having a slower finalization rate - otherwise it is indeed probably around 3 seconds). But do note that this gets worse if there are multiple calls being made. Are people OK with an 3x increase? Famously Amazon found that they’d lose 1% of income with every 100ms of latency added, but I can accept that maybe it’s not so bad for everyone.

This is how the original philosophy behind the system. You couldn’t choose the subnet that you deploy canisters on. In principle, creating a canister from a canister could’ve created that canister anywhere, not just on the same subnet - but it didn’t in the implementation. But people have regardless noticed that there was a performance difference between local and cross-net calls and have come to rely on that, and I think that’d be quite difficult to change now. I mean one could of course now make local calls slower on purpose so that they don’t come to rely on the performance, but think that that would leave a lot of people unhappy. Similar with composite queries.

This is already the case today - every inter-canister call costs the same.

2 Likes

I for one would not be happy about deprecating composite queries, we use them a lot. Having them work across subnets would be cool though even if it does take longer

Hi @frederico02, one thing about composite queries as they are now, is that it doesn’t work for querying data that lives across subnets, so it’s not compatible with any data that needs to be able to grow forever. Composite query methods are also not able to be called by other canisters when the other canisters are running within an update call, further limiting their use to unreplicated non-consensus offchain query calls.

Here is a quote from the interface spec about composite query methods: “Composite query methods are EXPERIMENTAL and there might be breaking changes of their behavior in the future. Use at your own risk!”.

There is a common pattern that works great for being able to query data that can grow forever and can be queried cross-canister and cross-subnet, and can be queried by other canisters running in update mode, and the data can be queried from a frontend/agent using fast non-consensus unreplicated queries or consensus replicated queries. The mechanics are as follows. Let’s say the user queries canister-A which then figures out that the data that the user needs is on canister-B. Canister-A can return canister-B’s-principal-id back to the user, and then the user (or the user’s frontend) can call canister-B directly for the data. The ICRC-3 standard uses this mechanism and it works great. A number of dapps are using it too. Let me know if this wouldn’t work for you.

If they can make composite queries work cross-subnet that would be better, but given that there is a pattern that can be used instead to do the same thing, I don’t think it is as important as the need to make the protocol infinitely scalable.

Great, less than 2.5x. Looks like the recent reduction of notarization delays is working. And this is 2.5x in the context of the difference between 1.2 seconds and 3 seconds. The plan in the original post does not affect calls to the NNS subnet since they are already cross-subnet calls.

If there are multiple calls being made it gets worse for both same-subnet and cross-subnet calls so it doesn’t make a difference to the plan in the original post.

Let’s be clear on where the increase will happen. It will only increase from 1.2 seconds to 3 seconds for canister-methods that currently call other canisters on the same subnet. Canister-methods that currently call canisters on other subnets, and canister-methods that don’t make any downstream calls, will not be affected in any way by the plan in the original post. Most canister calls on the network are already cross-subnet, every canister call to the ICP or SNS ledgers are already cross-subnet and will not be affected in any way by this plan.

The internet-computer’s values of decentralization, security, replication, infinite-scalability, availability/liveness, and consensus, are worth more than millisecond optimizations in my view.

Cross-canister calls taking between 1.2 and 3.0 seconds is a predictable latency. It is much more predictable than what we have now, where anyone can install thousands of canisters on a subnet and run heartbeats on all of them and then the other canisters on that subnet can take minutes/hours to complete a cross-canister call.

In the plan in the original post, even if someone installs a million canisters on a single subnet, every canister in that subnet will maintain the consistent latency of up to 3 seconds max, since the protocol will just keep splitting the subnet till the load is balanced.

The protocol must breath in and out for the balance.

Thanks for sharing that is cool to know, that does align with the infinite scalability plans.

I don’t see how it is possible that people have come to “rely” on that unless it is a rented subnet. Anyone can install canisters on a public subnet and run heartbeats or any computation, and then all other canisters on the subnet can take much longer to complete a same-subnet canister call, since all the canisters on one subnet share the same block space.

Another factor is that few weeks ago before the reduction in notarization delays, the time it used to take to do a same-subnet call is now the same as the time it takes to do a cross-subnet call.

No need to do that. Since same-subnet canister-calls on public subnets can already take much longer than usual when there is load on the subnet, it would be incorrect even now for canister authors to rely on any specific or consistent latency for the correctness of the code.

Thanks, my mistake. That makes this plan even smoother.

I also thought of this solution but it ended up being more tricky, we need it to work with pagination and with one filter ( a user’s principal ). Composite queries were the most optimal solution for us. Although they are experimental there was no good way for us to scale our data solution without using either an update call or a composite query.

Is it the most optimal solution for your data to be limited to a single subnet that can be filled up at any time by anyone?

I am using the solution mechanics with pagination and with two filters, a user’s principal and another indexing key. It is awesome, the system can handle infinite data scalability, the frontend queries are super fast, and the data can also be queried by other canisters in update mode.

Do you have a github link to your code? I would be interested to study it.

The code’s license does not let it be shared as a learning material. The concept is the same, return the second canister-id back to the user, and then the user calls the second canister directly.

Ah, yes i think i understand now.

No; the latency of an intra-subnet call has generally (prior to the past month) been ~0. So a call that makes, say, 2 downstream calls in a sequence (not parallel), would still take ~1.2s (to account for ingress processing time) if the downstream calls are on the same subnet, but it would take around 1.2s + 2*3s = 7.2s if the downstream calls are going across subnets.

I don’t see how you’d get <3s latency just by doing subnet splitting today. According to the dashboard, the IC has ~500 assigned nodes, and ~1500 nodes total. So a bit less than a 1000 spares, which would allow us to spin up ~70 more 13-node subnets. ~10 subnets are quite suffering now, so we could spread the load of those subnets over 7x more. But if it’s taking dozens of minutes to execute a message right now, cutting that down by 7x doesn’t get you down to 3s. To get there we’d need to onboard a ton of new nodes, which would also take time. Besides, given that the IC is already heavily subsidizing the subnet costs through inflation, I don’t think this is wise. Also, keep in mind that a subnet split would currently likely take hours to execute, since there’s a lot of manual work needed. So just executing the split would take weeks.

1 Like

I thought you were talking about multiple calls as in a big load on the subnets, then it makes no difference. I know what you mean, changing same-subnet calls into cross-subnet calls will create a bigger difference for ingress messages where multiple downstream calls are being made. It does take a little more time for each canister call that changes from same-subnet to cross-subnet, but I’ll quote my other point which addresses this.


That’s a great start. I heard there are node providers waiting in line for the chance to add more nodes to the network.

A possible bonus that we can get is that the performance can raise even more than 7x on those 10 subnets, due to the current latency possibly being a result of some exponential latency behavior that happens when too many canisters are all at once trying to execute on a subnet. Either way though, whether it is 7x or more, the load must be spread throughout as many subnets as it takes for the protocol to be usable.

Let’s first get the system usable and infinitely scalable first, then we can calibrate the costs.

The plan in this post is for the protocol to auto-split a subnet (maybe with a small manual confirmation step) sometime before it would hit some threshold of sustained load that would make the subnet unusable. The auto-split follows the logic in the original-post. We can’t be messing around with manual splits when there will be 50 subnets that need to split in the same day.

1 Like

If it comes down to this, we can take away (or lower) the subsidies, so that the canisters will cover their own costs, and then there will be no reason not to continue creating subnets.

Look I’m all for improving subnet splitting, it’s clear that, if the protocol is to succeed, it will need to scale automatically. However:

  1. To set expectations, having been involved in the existing implementation, I can tell you that getting it automated it is something that will take months of additional implementation work. So we’ll need something in the meantime. Possibly including manual splitting, but also just perf improvements - we’ll need these anyways (or price increases) if the system is to be profitable.
  2. I don’t think it’s necessary to throw away the baby with the bathwater, and not allow (small) co-located canister groups as well if we’re doing the work on automated splitting.
  3. We will either have to be very conservative with splitting also need a way to scale the network down once the load subsides. Otherwise one ends up with an unprofitable excess capacity. But AFAIK noone has started working on the economics of this.
3 Likes

Wonderful. It will be awesome.

Will take some months but will last forever.

Sure, sounds good.

A canister group would be limited to the resources of a single subnet, which could create a possible foot-gun for canister-authors. Canisters in a canister-group would have to account for the resource usage of the other canisters in the group, to make sure that collectively they don’t exceed the resources of a single subnet. That could cause confusion if developers try to build their dapps with canister-groups but don’t realize they will be stuck (can’t change canister-ids if they are holding threshold keys) with the storage and compute capacity of a single subnet. If you make it so that people can split their own canister-groups if they get too big well then might as well let the protocol handle the splitting automatic.

Without canister-groups, a canister doesn’t need to keep track of the resource usage of other canisters, it can just keep track of the resource usage of itself which is a much better model for programming canisters, and the benefit is greater when there are more canisters in the mix.

If you make it small enough so that a single subnet can always be able to handle the max computing and storage demands of each canister in a canister-group at the same-time without lowering the default storage and compute capacity of any of the canisters, I don’t think it would be too bad, but it will need to be pretty small and not sure if it will be useful or worth it considering the subnets’ overhead for keeping track of the groups. I wouldn’t use it, just because of the overhead of having to think about it, and there is not much benefit.

Either way canister-groups is not urgent and is not a necessity. At it’s best, it can save a few seconds here and there. Existing canisters wouldn’t be able to use it though.

I think at this point we are heading into a period of growth of the network for the next few years so it isn’t urgent, but it might be good to have in the long-term. I think once the protocol can auto-split a subnet, it will be simple to auto-merge subnets. It will use most of the same parts, tracking load, moving canister ranges to different subnets, and updating the routing table.

1 Like