Thank you @wpb, I much appreciate you bringing this to our attention. We don’t have the bandwidth to follow on all community discussions on all the platforms, so discussing the main points in this thread is a good idea.
Background — How the Governance team prepared SNS framework upgrades
Let me first point out that the Governance team does extensive, systematic testing to qualify each new SNS Wasm for a release. Afterwards, the NNS community carefully reviews the respective proposal before deciding if it should be blessed. This ensures that most categories of bugs don’t make it to production. Of course, testing depends on the data, not just the code, which makes it inherently incomplete; as a result, bugs might occur in production.
Before releasing a feature, the team also assesses the worst-case scenario (e.g., how would the system react to failures in each of its components?) and devises a recovery plan to mitigate such unlikely events if they still occur, so we’re well equipped to support you when necessary. In particular, we have tests that ensure that upgrades are never permanently stuck, so even if a bug would slip through, it could be resolved via a hotfix release (roll forward) which is generally safer than rolling back.
What do you think are the risks of automating updates to the SNS framework for each SNS?
in what ways does the opt-in for automatic SNS framework updates help protect the SNS instead of putting it at risk?
True, it’s possible that a newly introduced bug discovered in one SNS would help another SNS avoid it by not upgrading for some time.
On the other hand, postponing upgrades also delays when potentially vital fixes can be delivered to your SNS. For example, last year a subtle / high impact bug was discovered in ic-cdk (a library used by the SNS framework) causing memory leaks, and some SNSs could not quickly upgrade to the latest version that contained the fix. DFINITY could not really help them, either, as the community needed to pass a large number of legacy upgrade proposals, which only upgrade one SNS framework canister at a time. These days, there would be just one AdvanceSnsTargetVersion proposal required to pass, but even that adds up to 4 days of the time of hotfix delivery.
… every SNS that opts in will be canary release targets that enables DFINITY to find flaws in the upgrade.
I don’t think this is accurate. If a hypothetical flaw isn’t already caught during release testing, there’s no guarantee that it would actually be observed in just a few days after some SNS’s upgrade. For example, recently a subtle bug was triggered in Alice SNS due to the interleaved execution of two UpgradeSnsControlledCanister proposals, blocking SNS Root upgrades. I believe this bug has been around since the beginning, but it took years before someone stumbled upon it in production, which is to say that allowing for a limited number of days to pass before an SNS decides it’s safe to upgrade does not prevent the possibility of bugs, and does not enable DFINITY to reliably discover new flaws.
Is there going to be a deployment cadence to SNS upgrades or will they all happen in parallel?
Not quite in parallel — all SNSs check for available upgrades every hour and, if opted in for the automation, advance their SNS target versions to the latest available one. After that, each upgrade step is expected to take 2-10 minutes.
concerns about WaterNeuron being first in line for these upgrades
There is concern about automatic updates when WaterNeuron controls more than 2MM ICP.
I understand that this SNS handles a large flow of financial transactions and is thus especially cautious about risks associated with upgrading. Luckily, the WaterNeuron SNS community is not required to opt-in — that’s why we make this automation configurable while designing the new upgrade feature.
The new AdvanceSnsTargetVersion proposals enable streamlined upgrades without giving up any control over which upgrades are installed when. Even without opting in for the full automation, the WaterNeuron folks are already using this proposal to simplify their operations.
Perhaps if WaterNeuron could go last in automated updates it would alleviate concerns.
Is there any way that WaterNeuron can opt for an automatic upgrade that happens 4 days after all other SNS projects are automatically updated?
This is not currently supported. In addition, I don’t think it would be a sustainable way to set things up, as if SNS-A triggers upgrades only, say, 4 days after SNS-B, nothing would stop SNS-B from setting up similar rules, resulting in a cyclic dependency.
devs involved in the conversation seem to think automatic updates are blind updates that carry high risk.
Please refer them to the Background section above. To summarize:
- Not being on the latest version is risky, as that may slow down the delivery of hotfixes.
- SNS framework upgrades are proposed and blessed by the NNS only after meeting strict security requirements, including code reviews, holistic security reviews for new features, unit and integration testing.
- Special care is taken to keep upgrades backward compatible, breaking changes are rare and they should always be discussed with the community developers upfront.
Out of 34 total SNS … 7 have not started considering it (probably because they are too far behind to implement).
This by itself puts those 7 SNSs at risk, as if a critical vulnerability would be discovered (e.g., in one of the libraries used to implement SNS), we don’t have any estimates how long it would take to deliver the hotfix to those SNSs.
Therefore, I believe that the SNSs that are currently behind are those that would likely benefit from automatic target SNS version advancement the most, ironically.
Personally, I don’t really understand why there is even an opt-in option.
While I agree that the automatic target SNS version advancement feature is very useful for most SNSs (otherwise we wouldn’t have prioritized building in), it makes sense that different SNS communities may have different opinions on how to best deliver upgrades for their project. In particular, WaterNeuron seems like a very active voting community, and having to vote on a single additional AdvanceSnsTargetVersion proposal per week likely won’t add a significant overhead to their voters.
If the current ways for upgrading the SNS framework are still too inconvenient for someone, I invite the stakeholders to start a separate forum thread, describing what exact problem they would like to solve.
In the meantime, I recommend all SNS teams to closely follow the SNS Upgrade Aggregation Thread thread in which the Governance team announces the SNS framework upgrades being proposed.
Please let me know if I missed any questions!