Increasing DFINITY Node Count and NNS Topology Exception

In recent motion proposals it was agreed that DFINITY’s nodes should be treated equally to those of other Node Providers. In line with that decision, DFINITY reduced the number of data centers where it operates nodes to Stockholm (SH1) and Zürich (ZH2) and subsequently sold the nodes from BO1 and MR1 to new Node Providers (see this discussion).

However, some important operational challenges and likely future requirements were overlooked during this process. Currently, DFINITY operates a total of 42 nodes spread over 37 IC subnets. The current subnet recovery process expects DFINITY to have one node per subnet during recovery. While an additional node can be added during recovery, doing so would significantly extend recovery time and increase operational complexity. In recognition of these challenges—and given the exceptional importance of swift recovery—an extra three nodes are dedicated to the NNS subnet for recovery resilience.

This configuration effectively uses 39 nodes, leaving only 3 spares for all operational and redundancy needs. With one node in Stockholm currently showing degradation (and thus not being immediately usable), we effectively have only 2 healthy spare nodes. Moreover, the ongoing redeployment of the Zürich (ZH2) nodes using the HSM-less process requires that these nodes be removed from an active subnet and replaced by spare ones, further straining our spare capacity.

Additionally, as outlined in motion proposal 133841, there are plans to add up to 20 more application subnets depending on the IC growth. This expansion will not be feasible without an increased number of spare nodes.

To address these issues, we propose:

  1. Allowing DFINITY to operate more than the current 42 nodes—in particular, we would like to start a discussion about expanding the node allowance of DFINITY by 14 to 28 more nodes. Considering the motion proposal 133841, we believe it would be prudent to allow 28 more nodes which would leave us with approximately 10 nodes for maintenance. If community prefers a lower number instead, we would also be able to manage with 14 nodes at least in near term, and then we can re-discuss further expansions when needed in the future.
  2. Allowing an exceptional case for the NNS subnet whereby DFINITY may have 3 nodes that do not need to conform to the standard topology restrictions (typically limiting to 1 node per data center owner, per data center, and node provider). This exception has raised several questions in community discussions (see this thread) and would benefit from a clear, official reference for subnet membership change proposals for the NNS subnet.

I invite all community members to discuss these points and share feedback on the proposed adjustments. Your input is essential to ensuring that our operational capacity meets both current and future network requirements.

6 Likes

Related forum discussion:

I have no concerns about the proposed changes to the additional nodes. I would prefer that DFINITY be allowed the additional 28 node allocation so we don’t have to micromanage this decision in the future and so there are plenty of nodes for maintenance. I also have no concerns about making a topology exception for DFINITY. In my opinion, these deviations from the NNS approved node count and topology would be in the long term best interest of the internet computer.

If the community can have any input on the remuneration, then I would like to see the extra remuneration DFINITY would receive from these 28 extra nodes to be allocated directly to funding more known neurons to provide technical reviews in the Grants for Voting Neurons program. This means adding more individuals and/or teams for reviewing the IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management proposal topics. At this time, DFINITY only offers 2 grants per proposal topic. I would love to see 1-2 more teams per proposal topic, or whatever 28 nodes could afford.

@cryptoschindler @marc0olo

8 Likes

Is that cause you need access to the latest CUP to initialize the recovery? Are there any initiatives to eventually lift this dependency on the foundation?

2 Likes

I would fully support that! We need more (funded) community involvement!
cc @katiep and @lara as well, since you are somewhat involved in the community grants.

2 Likes

@Zane Yes (read state from all nodes to compare data and ensure nothing malicious is happening on one, and then write the recovery CUP to one node), and also yes to the second question. The Consensus team is actively looking into ways to improve and automate recovery.
However, recovery is a very tough problem to solve in general and for a general recovery (ANY case that may happen), it’s hard to come up with a fully automated recovery. So improvements will have to come one for particular cases first.

1 Like

The NNS proposal has been submitted. Thanks for the useful feedback in this thread!

https://dashboard.internetcomputer.org/proposal/135700

You may notice in the proposal summary that we propose not getting rewards for the added nodes. This was suggested by the DFINITY leadership in order to maintain fairness to other NPs.
This way DFINITY still gets the same node rewards as all other NPs, and we still get to have more nodes in case they are needed for subnet recoveries.

The way this would be implemented is by using type1 or similar node reward type on these nodes, since type1 nodes already get zero rewards.
It will be possible to check and confirm the reward configuration at any time with dre registry or ic-admin tools. And, of course, one can always keep track of all NNS proposals to confirm that none of them change the reward configuration for these nodes.

3 Likes

I was a bit hesitant about this proposal at first, but I think Dfinity not getting rewards for these additional nodes is the perfect solution. :+1:

4 Likes

Thanks @Sat, sounds good! I’ve voted to adopt.

Hopefully this will make it less likely that we see repeats of proposal 135540. Presumably once CUPs are verifiable, other NPs may be able to take part in subnet recovery (reducing the need for this DFINITY NP business rule)?

I guess this proposal will now serve at the latest IC Target Topology reference. I diffed the tables (current reference being this one)

Subnet Type # Subnets # Nodes in Subnet Total Nodes SEV Subnet Limit (NP, DC, DC Provider) Subnet Limit (Country)
NNS 1 43 43 no 1* (with exception for DFINITY nodes) prior discussion 3
SNS 1 34 34 no 1 3
Fiduciary 1 34 34 no 1 3
Internet Identity 1 34 34 yes 1 3
ECDSA Signing 1 28 28 yes 1 3
ECDSA Backup 1 28 28 yes 1 3
Bitcoin Canister 1 13 13 no 1 2
European Subnet 1 13 13 yes 1 2
Swiss Subnet 1 13 13 yes 1 13
Application Subnet 3151 13 403663 no 1 2
Reserve Nodes Gen1 100
Reserve Nodes Gen2 20
Total 7631023
2 Likes

I think that’s a separate topic. Subnet recovery would have to be executed very quickly and by people who know what they are doing. Getting hold of different node providers could be tough, people are in different time zones and anyway not constantly available. Plus many node providers are not that technically inclined. There was a theoretical exercise about this with a crashed subnet some time ago as I recall. Not sure how long it took for all node providers to execute those recovery instructions (it was not fast whatsoever), or if they even all did it in the end.

2 Likes

I agree this is a sensible and reasonable approach to handling subnet recovery coordination. I support implementing this solution.

1 Like

My understanding is that this is the reason DFINITY always has to submit a proposal to enable DFINITY engineers to SSH into the subnet nodes to facilitate recovery, because no other NP could do it themselves, given that:

  • Many (perhaps) none have the know how
  • Even if they did, their actions wouldn’t be verifiable

In my opinion, these are problems that need solving longer term (there’s already a plan for the second point above). I’d expect that addressing the first point is on the agenda too (longer-term).

1 Like