Proposal to add capabilities for emergency upgrades of governance canister via node owner/provider proposals

Summary

All Blockchains work on consensus. The typical protocol upgrade mechanism (e.g. BTC, ETH) is that projects’ respective communities convince node providers to agree to perform a certain protocol upgrade at a certain point in time. This is typically done by a lot of conversations and off-chain agreement.

One of the things that makes the Internet Computer adaptable is that it can upgrade itself on-chain by allowing neuron holders to submit and vote on proposals submitted to the governance canister (which resides in the NNS subnet).

That means that changes to the Internet Computer happen via proposals submitted and executed via the governance canister - this requires over 50% of neuron voting power (token holders, essentially). This is completely on-chain consensus.

But what happens if the governance canister is broken or has a bug? How can it be upgraded to fix itself if it cannot accept or execute proposals? Who Watches the Watchmen?

In this case, the Internet Computer has to fall back to the typical blockchain mechanism for upgrades: the community has to agree on a new binary and convince a supermajority of node providers to deploy it. This can be a heavy and slow process and it is not auditable as cleanly as voting on proposals of the governance canister.

This proposed solution is to create an “emergency fallback” mechanism in case the governance canister is ever down, but not fall back all the way to “asking a supermajority node providers to deploy a new version of the binary manually.” The proposed solution is a mechanism to have an alternative proposal submission, voting, and execution exclusively to update just the Wasm code of the governance canister in case it is ever down that collects votes from node providers instead of neurons (the latter being impossible since the governance canister is down) and requires ⅔ + 1 of nodes to agree

Status

Discussing

Key people involved

Johan Granstrom (@johan ), David Ribeiro Alves, Lara Schmid

Pros and Cons

The pros and cons of the feature being proposed:

Pros:

  • There would be an auditable on-chain trail of emergency proposals that use this mechanism.
  • This solution would affect the upgrading of the governance canister only, not any other parts of the network
  • This solution would require the same Byzantine Fault Tolerance (BFT) security assumptions: at least ⅔ of nodes in the NNS subnet need to agree. This is a higher bar than the simple majority of regular proposals and the same BFT assumptions the IC holds for consensus.
  • This solution is meant to be used in case of emergencies so the IC community does not have to fall back all the way to the more basic method of convincing nodes and manual upgrading of nodes. So this is an easier, safer, and faster safety net than which currently exists.

Cons

  • This design solution would reduce the friction for node providers to update the governance canister. One could see the current pain of having node providers agree to update binaries as beneficial.

Basic Questions

1. Has the governance canister ever been down so nodes needed to manually agree to update the governance canister (⅔ node providers agreeing)?

a. Yes, only once. The first week after Genesis, there was a bug where NNS had reached a max number of outstanding proposals, and was not clearing the backlog. This led to an Incident report and various bugs reported in the forums and socials. The NNS was ultimately updated by asking node providers to manually upgrade the software running. Because it is a decentralized system, the Foundation could only ask node providers to do it for the sake and health of the network.

b. Because it would require at least ⅔ of the nodes in the NNS subnet to agree (which is the security assumption of the entire network), this manual upgrade fell within the existing security parameters.

2. Can nodes collude to take my ICP?

a. This change would update the Wasm of the governance canister, not any other canister (such as Ledger Canister) or part of the network.

b. The harsh truth about all consensus protocols is that all blockchains (including the Internet Computer) have the property that if a supermajority of nodes agree on the state of the network, then that is the state of the network. In practice, both game theory and the increasing number of independent parties make this extremely improbable. Though the Ethereum blockchain’s state was famously rolled back after the DAO hack, this has only happened once, and at a high cost. In this case, it would be a supermajority of nodes in the NNS subnet.

c. The IC minimizes this risk by maximizing the number of independent node providers. The IC is still young and will continue to add new independent node providers over time.

3. Why does the IC community need this now?

a. This is a relatively easy change to make and an action item from our lesson at Genesis where it was very hard to update the network. This change will make the network more resilient. The governance canister is too important to risk making it difficult to upgrade. If it is down, all other proposals and upgrades stop.

4. Will this feature take a lot of time better spent elsewhere?

No, it is fairly simple, and as part of the lesson from 3.5 months ago, DFINITY foundation engineers already have the implementation done, but not deployed yet.

5. What is the DFINITY Foundation asking the community?

a. The Foundation is asking the community to engage in this thread, ask questions and provide feedback. The DFINITY Foundation will also schedule a community conversation on the week of September 27 (which will be recorded). The intent is to submit an NNS motion proposal to gauge the community’s reaction. If the motion passes, then the Foundation will submit an NNS proposal with the actual code change. If the motion fails, it’s back to the drawing board.

b. Ask questions on the proposed mechanism

6. Has this feature been security vetted?

a. This feature has been vetted by the NNS team and the Security team at DFINITY.

8 Likes

We want lots of eyes on this, so I will tag some folks I have seen very actively involved on some recent threads I have made: @lastmjs @skilesare @paulyoung @nomeata @wang @jzxchiang @mparikh @dpdp

4 Likes

On the face of it, this is a worthwhile (and necessary) goal.

Given that

(i) our human history is replete with “man made” emergencies that usurp power and have given rise to not-nice “government” structures

(ii) decisions being made by “elitist” folks (such as the federal reserve) without regard of the pain on the ground

In my Givens, I am not trying to equate the CURRENT state at IC to the dire things that have happened in the real world. I am trying to see how to prevent these on IC in the future.

A. What constitutes an emergency? Who determines that it is an emergency? Ordinarily we can rely on the voting mechanism provided through NNS(& the governance canister ). But, obviously not in this case.

B. Not all node providers are created equally (different node providers have different number of nodes). What if certain malevolent node providers hold the votes so that the governance canister cannot be updated without “paying for update”?

C. The node providers have no inherent stake in ICP. Sure they get paid in ICP & sure they have to setup infrastructure. But that portion of cost is miniscule as compared to the current market cap of ICP (& will be even less in the future). Aren’t we providing a disproportionate influence to the node providers?

D. How does the voting (even with node providers) actually occurring? What do they need to know in order to vote?

These questions are tough. They are intertwining both process and security. I don’t claim to have answers. But only through thoughtful insights will we come out at the other end.

Perhaps the NNS team and the security team could also chime in?

5 Likes

I may have missed it but I didn’t see any details about implementation. I understand the end goal, but how would this mechanism actually work? If the governance canister is down then what canister is providing this fallback mechanism?

2 Likes

You need something like this. “Code as law” only goes so far. The free world generally operates with a layer of common law that allows for a lot of wiggle room when the real world breaks outside the boundaries of the written law. This has positive and negative results but is really effective in times of high volatility.

Axon might be a good solution here except that if the governance canister being down is affecting the whole network, that might not be a good solution. Maybe there is a plan A that uses Axon and plan B that uses Ethereum or some other chain in an emergency’s emergency.

2 Likes

@mparikh Hi, I’m the team lead for the NNS team. Hopefully I can shed some light on some of your questions:

A. What constitutes an emergency? Who determines that it is an emergency? Ordinarily we can rely on the voting mechanism provided through NNS(& the governance canister ). But, obviously not in this case.

An “emergency” in this case has a very strict definition: The governance canister is unavailable for receiving/voting/executing proposals due to some factor (bug, DOS attack etc) and thus we can’t rely on the normal, neuron based, upgrade procedure to solve the problem. It shouldn’t be used in any other case.

B. Not all node providers are created equally (different node providers have different number of nodes). What if certain malevolent node providers hold the votes so that the governance canister cannot be updated without “paying for update”?

The underlying security assumption for most BFT consensus protocols is that at most 1/3 of the nodes are malicious (and malicious has a broad definition, can mean buggy, unavailable, incorrect implementations or actively trying to subvert the protocol). This mechanism leverages the same exact assumptions: It will require 2/3 of the votes are for a proposal before pushing it through. The alternative of having each node manually upgraded by its node provider to run a new binary, which would be an off-chain event, has precisely the same security constraints: 2/3s of nodes would have to be upgraded.

C. The node providers have no inherent stake in ICP. Sure they get paid in ICP & sure they have to setup infrastructure. But that portion of cost is miniscule as compared to the current market cap of ICP (& will be even less in the future). Aren’t we providing a disproportionate influence to the node providers?

Couple of things here:

  • We’d like and will strive to keep increasing the amount of nodes that run the NNS so that this power is even more distributed that it is now.
  • Like for most chains those who run the protocol have the power to change it. This is already true, this mechanism just facilitates that under the precise same security assumptions for a very narrow case (governance canister upgrades) and only on extraordinary circumstances so that: 1) there is an on-chain record of the change 2) the change can be implemented faster and with less risk.

D. How does the voting (even with node providers) actually occurring? What do they need to know in order to vote?

One of the node providers that participates in the NNS subnetwork can submit a special proposal to upgrade the governance canister. Other providers can vote on that proposal. Both of these things are done through the command line.

6 Likes

For some reason, the forum is not letting me update the summary, but wanted to note that David Ribeiro Alves and Lara Schmid have handles @dralves @lara respectively

@skilesare if I understand correctly Axon is meant for distributing the power of one or more neurons across multiple stake holder. It’s great in what it does, but it does still require neurons to be working. The case where neurons are not working, though, is precisely the problem this proposal aims at solving.

1 Like

@LightningLad91 the controllership layout of the NNS canisters is roughly the following.

There is a root canister that controls all the other canisters, thus only this canister can upgrade them. However, currently, this canister does not accept proposals. NNS canister upgrade proposals are submitted to the governance canister which, upon acceptance through neuron voting, tells the root canister to upgrade the given canister. This poses a problem as the governance canister is a single point of failure, if it breaks nothing can be upgraded.

This proposal is to add a special type of proposal with a very limited scope (only to upgrade the governance canister and only submittable by a node provider running a node on the NNS subnet), that can be submitted directly to the root canister. Since governance is broken in this scenario, we can’t rely on neurons to vote, so node providers would vote instead, with the exact same parameters that would be required if they were to force upgrade the ic replica binary instead, but with much less risk and much higher speed. Additionally this is completely auditable because there would be an on-chain record of the proposal and any community member can monitor these proposals since there will be a method to return the current set of pending ones that is open to all.

9 Likes

Thank you for the response. This makes a lot more sense now.

1 Like

The questions are good. It helps emerge the ambiguous or fuzzy areas in my writing. Thank you!

2 Likes

I’m curious, how many independent NNS node providers do we have today, and are nodes machines distributed equally between them? If node providers are the ones voting then I’d like to know the minimum number of node providers required to represent 2/3 of the NNS nodes.

Also, once this special proposal is implemented what prevents NNS node providers from submitting a proposal when there isn’t an emergency?

3 Likes

According to dashboard.internetcomputer.org, cuurently there are 233 nodes with 53 node providers. You would need 78 nodes (33% of 233 nodes) to vote no to prevent an update. Based on my analysis, largest 4 node providers can collude to prevent an update on the governance canister. This should be a matter of concern because the power is concentrated in relatively few hands( ~ 8%).

3 Likes

Does this take into account the fact that not all of those nodes are running the NNS subnet? Would this 2/3 vote only be 2/3 of the nodes running the NNS subnet? That’s much less node providers, if so.

2 Likes

That’s a good point. My current calculations do not take that factor into consideration. It was meant as rough order of magnitude (ROM) calculation.

Also several node providers provide 0(zero) nodes.

2 Likes

If node providers can update the governance canister, they can override NNS decisions and take control of the Internet Computer. If we reduce the friction to do this, we effectively give node providers a veto over NNS decisions that are not in their favour. They probably wouldn’t exercise this “nuclear” option because it would destroy the whole project, but they fact that they could would shift power dynamics. Am I misunderstanding something?

6 Likes

I have a similar question.

Can these emergency proposals only be submitted when the governance canister is down? Will this be enforced in code?

I’m also curious why only node providers that participate in the NNS subnet can submit these emergency proposals. Is it totally random which node providers happen to run nodes assigned to the NNS subnet? Or is there some special requirement / privilege?

6 Likes

Also, what happens when the root canister goes down?

3 Likes

@dralves

Would it be possible to maintain a backup governance canister that runs alongside the primary one? Perhaps the backup canister maintains a known good state so that if the primary fails the node providers can vote to switch control to the backup canister. Then we can use the normal voting process to bring the system back to where it needs to be.

If we have to put faith in the node providers during this sort of emergency I’m okay with that; but, I’d like their options to be very limited. Giving them the ability to propose and implement a change to the governance canister seems extremely risky.

2 Likes

@MalcolmMurray you raise a good point. However, like most blockchains, by being the ones that run the protocol it is already the case that node providers control it, including the NNS. One advantage of doing it this way, for the rest of us that have a stake in the system, is that there is a fully auditable trail of whatever happens, both on-chain, and with a dedicated API that all users can query.

3 Likes