Suggested approach to make more compute capacity available on the IC

TL;DR

Given the rapidly increasing load on the IC, we propose adding additional application subnets to provide developers with greater flexibility and choice when selecting a deployment environment. This expansion will increase the capacity of ICP and accommodate continued growth in the coming months.

The following three step approach is proposed:

  • Step 1: Expand the list of public subnets. This step has already been completed
  • Step 2: Open subsets of the currently verified subnets to the community. Of the existing subnets, several so-called verified subnets are not open for deployment because these contain existing legacy canisters, some of which depend on a special replica configuration. These verified subnets - of which there are 11 - will be gradually opened.
  • Step 3: Extend the target topology by 20 additional application subnets and gradually submit proposals to create these subnets as capacity is needed.

Note that increasing the number of public subnets is proposed in parallel to the performance improvements for individual subnets, outlined in this forum post.

Step 1 - Expanding of the list of public subnets

Currently, as can be seen on the IC dashboard, there are a total of 1469 available node machines and a total of 568 node machines active in the network. At the same time, the IC is seeing subnets with heavy compute load. Although the primary focus is to improve the performance of the protocol (in the scheduler, in the execution layer, and by subnet splitting) it is also important that developers have more flexibility in choosing which subnet to deploy and run their canisters.

One step already taken is the expansion of the list of public subnets. There are now 17 application subnets publicly available, allowing all principals to install canisters and take advantage of the available resources. But as the number of canisters on the IC continues to grow, the load on these subnets and the subnet state of these subnets is increasing. The below table shows a snapshot of the approximate number of canisters and approximate state size per subnet, on 22nd Oct. It also shows the practical limit of each subnet in terms of number of canisters and state size.

As can be seen from the table, there is a lot of activity on each subnet, but each subnet acts well within the limits set for it. There is currently a limit of 120’000 canisters per subnet, and a limit of 750 GiB for the subnet state size. Core protocol engineers are actively working on lifting these limits.

Step 2 - Making verified subnets publicly available

As a second step, it is proposed that some of the existing subnets that are not available yet for developers, are also made public. These so-called verified subnets were created at the IC Mainnet launch time and currently have legacy canisters running on them, some of which depend on a special functionality that isn’t considered generally safe anymore and is thus deprecated. In alignment with the canister owners this legacy functionality will be removed and the subnets will be gradually opened up.

The table below has a snapshot of the verified subnets in terms of canisters and state size, as per 22nd Oct. As can be seen, there is substantial additional capacity on these subnets available.

Step 3 - Adding more application subnets with existing spare node machines

In addition to opening up existing subnets for canister deployment, additional subnets can be added to make even more capacity available on the IC network. Using the target topology tooling one can calculate that with the existing spare nodes, at least 30 application subnets can be added (see also the diagram below in which a optimization calculation was made, taking into account the recently approved approach for Gen1 node machines after 48 months).

Adding new subnets will of course increase capacity, but also has several (smaller) drawbacks:

  • Subnets that are created, currently cannot be removed.
  • Consequently, there will be fewer spare node machines available that can be used to replace dead or degraded nodes on any subnet.

We propose to increase the number of application subnets in the target topology by 20 additional subnets, to be prepared to provide additional capacity as demand keeps growing. This is well within the maximum number of application subnets that can be added to the IC (which is at least 30, as described above).

The updated target topology is shown below. A motion proposal to update the target topology will be issued shortly.


Proposed Target Topology with update line entry in grey/bold.

Feedback

As always, feedback from the community is very much appreciated on this proposal. The intention is to formally submit the motion proposal for the update of the target topology in the next few days.

20 Likes

Like the initiative! We migrated from the o3ow2 subnet to a less congested one, resulting in a 3–5x improvement in form responsiveness. The performance difference in loading times between our web3 forms and traditional web2 forms is now minimal - only around 1-2 seconds.

However, we’re noticing a gradual increase in data upload times to the canisters, which impacts overall efficiency. To stay competitive with web2 survey tools and forms, we would greatly benefit from access to additional, low-traffic subnets.

If necessary, we’re open to exploring the option of deploying our own node/subnet.

2 Likes

I’m not entirely sure why a developer should be responsible for choosing and taking responsibility for the subnet their application should run on.

  • Isn’t the Internet Computer (IC) designed to be infinitely scalable?
  • Isn’t the IC protocol supposed to simplify development by abstracting away the hardware limits of individual subnets?
4 Likes

Hi @Doudi many thanks for your response.

To answer your first question, yes the IC is infinitely scalable, with more subnets or larger subnets, but you have to consider the impact this has on performance (having more nodes in one subnet will require more nodes having to reach consensus, and having many subnets will mean an increase in cross-subnet communication).

This immediately points to your second question, in that there are several trade-offs to be made in terms of reaching the right balance between capacity and performance. I would suggest to have a look and the discussion being picked up in this forum thread, and please feel free to post any additional questions there as well. As is stated in this thread, in terms of load balancing and as a next step - abstracting this away in the protocol - this is something that the IC could grow towards but of course takes development time.

Hope this helps answer your question.

1 Like

Dear all, the corresponding motion proposal 133841 to add application subnets to the IC Target Topology has been submitted for voting and can be found here.

1 Like

These 20 new subnets, nodes are already being paid in monthly rewards? Or this will add additional costs to the network @kylelangham also if the answer is yes, are not being accounted,Then why we aren’t deflationary yet? if we are adding nodes means every subnet is full capacity, If every subnet is at full capacity that should mean we already gotta be deflationary, because revenue generated by these nodes will cover the monthly cost of this nodes isn’t it? But we are not deflationary meaning not all nodes are being used at full capacity making them profitable right? so why adding new nodes with the Idea that this is being implemented due to the network is running at full capacity?
Does it make sense? Simply I don’t understand why we need to add new subnets if we are not deflationary yet, but those subnets are being added with the purpose of have more compute capacity, meaning with are at full capacity currently, so if we are at full capacity why we can’t achieve deflation yet?

This is either because this nodes are already included in the monthly node rewards payment, and therefore were just creating inflation without any revenue at all, or if they are not included means charging price model is not accurate and subnets aren’t burning according to how much it cost to sustain a subnet, OR subnets are not really at full capacity and that’s why we still not deflationary. Thanks!!

Reserve nodes, which are not yet assigned to subnets, already receive node rewards.

Regarding your question on subnets at full capacity and pricing: See this thread, for a review of compute pricing to ensure that subnet revenues adequately reflect operational costs.

2 Likes

How has this been verified? My understanding is that the referenced decentralisation optimisation calculation app provides rough optimisitic estimates. The problem of node-subnet constraint satisfaction is an np complete one.

There are surely not enough spare nodes with a diverse enough set of characteristics to satisfy the IC Target Topology with an additional 20 subnets (yet no new nodes onboarded).

Its already a struggle to satisfy the IC Target Topology (and it’s already not being met on every subnet).

Should subnet management proposals be rejected if they take subnets away from a state that meets the target topology? I can guarantee you that this will happen if this proposal is adopted. My current stance is that such proposals should be rejected (otherwise what’s the point of a target topology, particularly one that cannot practically be met)?

It feels like what is needed is an agreed set of tolerances for the target topology. On one end of the spectrum of these tolerances would be the desired state of decentralisation for a subnet. On the other end of that spectrum would be the worst state of decentralisation that is considered permissible (i.e. for a temporary period of time). Subnet Management proposals that sit within this range would then objectively not be violating the target topology mandate, so reviewers know where they stand in terms of doing the right thing (rejecting or accepting SM proposals).

I would love to see this, and under those conditions, would love to see 20 new subnets (assuming they’d be achievable within the agreed decentralisation tolerances).

1 Like

Hi @lorimer, the decentralization tool is a linear optimization tool and uses the available existing node machines, so it provides an good estimate whether the target topology can be achieved. It is true that there can be dead or degraded nodes, and nodes in maintenance, and that spare nodes are not yet included in the linear optimization, but we can make informed decision based on the tool. Application subnets (13 node subnets) require only gen1 nodes and as the tool shows (and also the IC dashboard) there is a huge pool of gen1 nodes with which application subnets can be formed, while still keeping enough spare nodes. It also shows (as we discussed in this previous forum post) that the number of (spare) Gen2 node machines is rather limited, making it not possible to add any larger subnets at the moment (with 28 nodes machines or more).

Actually achieving target topology requires numerous node replacement proposals (subnet management proposals) which it was agreed in the same thread would be achieved by the weekly subnet management proposals that replace dead nodes. This of course will happen over time and the speed of which will depend on how many subnet replacement proposals are needed every week.

1 Like

@Lorimer it would be great if there would be defined tolerance conditions for deviation from the target topology. Do you have any suggestions for these?

1 Like

Thanks @SvenF. Please correct me if I’m wrong, but doesn’t this tool only consider each decentralisation characteristic (e.g. country, node provider, data center) in isolation of all the others? In other words, sure there appear to be enough spare nodes in a diverse enough set of data centers, but you have to consider the countries those data centers are in, and the associated node providers, and all the other decentralisation characteristics simultaneously (otherwise you’ll be basing decisions on massively over-optimistic/window dressed stats).

Thanks, I’m glad you’re receptive to the idea :slightly_smiling_face: I’m not sure how to establish an acceptable lower bound for the tolerances. The upper bound should be the optimum that can possibly be achieved under ideal conditions given the current nodes and their diversity across the relevant characteristics. I think a sensible lower bound would need to be calculated by taking a percentage of random nodes out of action (simulating failure conditions) and then calculating the optimum under those conditions.

My understanding is that the tooling currently isn’t there for answering these questions in a rigerous way. I’m hoping to help get there with the optimisation tooling I’ve been working on. Until that’s further along, I guess the best option is to feel our way through the problem.

Perhaps something to discuss further over the next few weeks, based on what seems feasible in practise?

Hey @lorimer, the optimization tool considers all the metrics simultaneously (that’s why it is using linear optimization). Hence the conclusion that it should be possible to add additional application subnets based on Gen1 node machines.

But as you pointed out in previous discussions, the dre-heal tool that is being used for node swapping uses a slightly different approach. It has an additional metrics for region, and it also calculated the average decentralization initially. The last has been corrected so now the difference between the two approaches is not that large anymore, although still only dead/degraded nodes are being swapped until there is agreement on one aligned approach for decentralization metrics.

1 Like

Sure, happy to discuss further!

1 Like

Thanks @SvenF, sounds like I need to revisit the details of the decentralisation tool implementation. I’ll do that as soon as I get a chance :slightly_smiling_face:

1 Like