The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
At the time of this comment on the forum there are still 2 days left in the voting period, which means there is still plenty of time for others to review the proposal and vote independently.
We had several very good reviews of the Release Notes on these proposals by @Zane, @cyberowl, @ZackDS, @massimoalbarello, @ilbert, @hpeebles, and @Lorimer. The IC-OS Verification was also performed by @jwiegley, @tiago89, and @Gekctek. I recommend folks take a look and see the excellent work that was performed on these reviews by the entire CodeGov team. Feel free to comment here or in the thread of each respective proposal in our community on OpenChat if you have any questions or suggestions about these reviews.
The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
Are you able to share any further information about the event that occurred that brought the II and XRC canisters down and/or required them to be taken offline?
I gather that the schnorr commit that made MasterPublicKey mandatory was involved in this issue, given that it’s reverted in this latest release as the only change. Are you able provide details about how the issue unfolded?
I also have a couple of side questions if that’s okay:
I see that uzr34 is often the first subnet to receive a particular version of the IC-OS releases (the version that has the new storage layer feature disabled - for now). Given that this is a critical system subnet, I suspect there’s a good reason that new IC-OS releases rollout to it as a priority (rather than testing the waters first with some less critical subnets). Presumably this is due to the dependency that other subnets have on the uzr34 subnet? If you’re ablet to clarify, this information would be useful.
I noticed that the II canister was upgraded shortly after the incident was resolved (having already deployed a recovery catch-up package). The canister upgrade featured numerous changes. Has that II canister upgrade played a part in getting the uzr34 subnet back on track, or is the timing of that upgrade a coincidence?
I’ve noticed that there’s not a second IC-OS hotfix release (one that has the storage layer feature enabled). Presumably that’s because the issue experience by the uzr34 subnet is not applicable to any of the subnets that have had the new storage layer feature enabled. Is that correct? Any elaboration you’re able to provide about why those other subnets are not susceptible would be useful.
A more general question about the relationship between the new storage layer feature on/off proposal pairs; Can I ask if anyone knows why the storage layer feature flag has been implemented as a compile-time setting (requiring two different binaries, and therefore two separate IC-OS election proposals for on/off settings), rather than a runtime or install-time flag? My understanding is that the latter approach would require just one binary and fewer IC-OS election proposals (along with some other benefits)
I can take the two questions regarding the storage layer.
We decided to only do one hotfix, because there are only a few subnets affected, and all these subnets affected either 1) weren’t supposed to get the feature yet, 2) can wait another week, or 3) have 0 canisters so the feature doesn’t actually do anything. So no point going through the hassle of a 4th replica version.
As for why it’s a compile time flag, we do indeed generally prefer to make it a runtime flag. In particular, many things you can change via an NNS proposal. However, that means the flag can flip at any point in a running replica, which in this case would have complicated matters too much. The advantage of a compile time flag is that it is constant from the start of the replica process onwards. It is a complex feature that affects multiple stateful components of the IC and their interaction. Adding the complexity of it changing at any point was not worth the risk vs the awkwardness of having 2 builds for a bit. Note that the fact that we have multiple weeks of the feature branch is another consequence of us being mindful of the complexity of the feature, and its associated risks.
Are you able to share any further information about the event
yes we’ll do a proper post mortem as always
I see that uzr34 is often the first subnet to receive a particular version of the IC-OS releases (the version that has the new storage layer feature disabled - for now). Given that this is a critical system subnet, I suspect there’s a good reason that new IC-OS releases rollout to it as a priority (rather than testing the waters first with some less critical subnets). Presumably this is due to the dependency that other subnets have on the uzr34 subnet? If you’re ablet to clarify, this information would be useful.
It’s often one of the first because that way we have more time between upgrading ECDSA subnets.
I noticed that the II canister was upgraded shortly after the incident was resolved (having already deployed a recovery catch-up package). The canister upgrade featured numerous changes. Has that II canister upgrade played a part in getting the uzr34 subnet back on track, or is the timing of that upgrade a coincidence?
This was unrelated
I’ve noticed that there’s not a second IC-OS hotfix release (one that has the storage layer feature enabled). Presumably that’s because the issue experience by the uzr34 subnet is not applicable to any of the subnets that have had the new storage layer feature enabled. Is that correct? Any elaboration you’re able to provide about why those other subnets are not susceptible would be useful.
The problem only appears on subnets that hold a threshold ECDSA key, and it’s fine to not have those subnets run the new storage layer this week, so we only propose a single hotfixed replica version without the new storage layer for those four subnets.
Thanks Manu, I really appreciate your answer. I’m mostly following. Are you able to elaborate a little bit about maximising the time between upgrading ECDSA subnets? For example, assuming Dfinity avoided deploying IC-OS updates to system subnets until later in the rollout schedule (instead, deploying to less critical application subnets first), what would the downsides be? Currently I’m not sure how many application subnets hold a threshold ECDSA key, but I’m actively working on acquiring a working knowledge of this sort of info. Thanks again.
The two ECDSA keys that ICP holds (Secp256k1:key_1 and Secp256k1:test_key_1 are each on two different subnets, one is actively signing and one is just holding a copy. This means that if something bad happens to one subnet, the other subnet would still have a copy and is able to reshare the key to other subnets. But suppose there would be some very bad upgrade, and we upgrade both the backup and the signing subnet at the same time, now we jeopardize the ECDSA key. So to be super cautious we typically spread those upgrades out.
Thanks a lot for going into further details. This is really helpful info. It looks like fuqsr an (application subnet) typically receives IC-OS deployments after having already deployed the same version to uzr34 (a system subnet, which you’d expect to have a bigger impact if it’s canisters go offline). Out of curiosity, why not deploy to the less critical subnet first (to reduce the blast radius of unanticipated issues)? It doesn’t seem like this would prevent the deployments from being spaced out as you mentioned (which maximises the chances of spotting an issue before the IC-OS version has been deployed to both a backup and corresponding signing subnet). I suspect I’m missing something still, but that’s why I’m asking
Okay, thanks Manu. Out of interest, does the rollout schedule for new IC-OS releases tend to be planned up front?
It would be great if the rollout schedule (or just the planned subnet release order) could be included in the IC-OS election proposal description and/or included in the associated forum post (or at least a list of the subnets that are planned to receive the new release first). This would allow more eyes to be cast on the specifics of how an IC-OS release will be handled. Providing this information with the IC-OS election proposals means it’s available in advance of the deployment proposals appearing. They’re numerous and aren’t typically open for long, sometimes as little as a few seconds.
Thanks for all the info shared relating to this incident. I have a general question about IC-OS version testing. Does Dfinity have a mechanism in place to sync their testnet with mainnet, to ensure representative testing? If so, are there limitations with how much the two environments can be synced and/or how frequently? I’m just asking mostly out of curiosity.
That’s a great idea @Lorimer . We sadly don’t do that, and are unlikely to do it. Here is an explanation why.
First reason is that the IC Mainnet state is considered sensitive and confidential. As all node providers, DFINITY could break into the nodes (since we don’t have SEV-SNP enabled yet), to access subnet state(s) and copy the subnet state to some other machines. However, this wouldn’t be seen in positive light either internally or in the community, since we should be setting an example here.
Second reason is that the subnet state can be pretty big (e.g. 100s of GB) and copying over subnet states could take very long time, from hours to days, additionally slowing down the rollout process for new versions.
As a part of the post mortem discussion, we will review some other options for improving the success rate of subnet upgrades and we’ll share that with the community and ask for feedback before we start the work.
Thanks @sat. Presumably the privacy concerns don’t apply to system subnets, which also have significantly fewer canisters (5 or 13, instead of many thousands), so size constraints shouldn’t be too prohibitive either. Am I still missing something?
We sadly don’t do that, and are unlikely to do it
Would Dfinity not consider doing this just for system subnets (which are obviously the most critical)? Also, what about subnet config/payload state (as opposed to canister state) - presumably this could be synced without running into size or privacy concerns regarding application subnets?
Anything that can be done to automatically keep tests relevant and effective seems worth perusing. I hope you don’t mind all the questions. I learn a lot from the answers
Actually the privacy concerns apply even more for the system subnets. For instance, on the internet identity subnet a malicious actor could examine canister heap and make some analysis and guesses about the authentication of some user accounts. It’s not extremely dangerous but it wouldn’t be desirable either. Internet identity is used on many other canisters and subnets so I suppose a fair amount of data could be mined.
For the nns subnet, one could be concerned about the neuron info. Again not overly concerning, but it’s a risk.
What’s being discussed now as potential strategic activities is automatical subnet rollback in case an upgrade fails, and maybe read only (canary) nodes that could be upgraded to the new version before the rest of the subnet. Both approaches have pros and cons and both have development cost, so we’ll have to carefully evaluate them, considering other high priority development that engineers are working on.
Hi @stefan.schneider, thanks again for the information that you’ve shared regarding the LSMT feature. The NNS incident yesterday made me revisit this post. I gather that the NNS subnet (tdb26) was the only subnet running GuestOS version b9a0f18 when the incident occurred, and that the hotfix rollout (election and deployment) is essentially the same GuestOS version but with the LSMT feature switched off (which the NNS is now running).
Getting the hotfix in obviously required two proposals due to a suitable GuestOS binary not already existing in the registry. Every time a GuestOS election gets pushed as a hotfix without community scrutiny and voting, it adds fuel to the fire that the FUDers like to feed (i.e. that IC isn’t really decentralised and it’s all just theatre). But of course, this needed fixing right away. Theoretically, if the LSMT feature had been a runtime flag, I gather there would have been no need for an election proposal, nor a GuestOS deployment proposal. There would only have needed to be a subnet config update proposal as a hotfix (the scope of which seems smaller).
My understanding from your previous response is that a runtime flag was considered, but avoided due to the dangers of it flipping during the course of a replica’s execution. Doesn’t this depend entirely on how the GuestOS reads the config? Couldn’t it be implemented such that the GuestOS reads certain runtime config only at startup? An NNS function that supports updating the subnet config and restarting the replica could then have been used to deliver this hotfix (without any danger of the LSMT feature flipping during the course of execution).
I’m asking only to learn a little more about IC infrastructure, and hopefully to stimulate conversation about how to make certain hotfixes smoother and more palatable for a wider audience.
On a related note, I gather that if the storage limit had been increased already then switching off LSMT could have been riskier (resulting in performance issues). Are you able to share when you’re planning to increase the storage limit (hopefully not for a while, so there’s time for any other lurking issues to surface)? Will it be increased in one fell swoop, or in small increments?
Thanks in advance, your insights are always appreciated