Proposal to elect new release rc--2024-05-09_23-02

Hello there!

We are happy to announce that voting is now open for a new IC release.
The NNS proposal is here: IC NNS Proposal 129696.

Here is a summary of the changes since the last release:

Release Notes for release-2024-05-09_23-02-base (2c4566b7b7af453167785504ba3c563e09f38504)

Changelog since git revision bb76748d1d225c08d88037e99ca9a066f97de496

Features:

  • 1bc0c49ae Consensus(ecdsa): Add height to EcdsaArtifactId
  • c8788db4a Consensus(schnorr): Make MasterPublicKey in EcdsaReshareRequest mandatory
  • 0bc85d7c3 Consensus(schnorr): Generalize pre-signatures in EcdsaPayload
  • 8dcf4b5b9 Consensus(ecdsa): Replace a singular key_transcript with a collection of key_transcripts in the EcdsaPayload
  • 97fac2a61 Consensus: Add instant-based fallback to adjusted notarization delay
  • 65b4a7fcc Execution: Load canister snapshot management types
  • 196a91925 Execution,Message Routing: More accurate canister invariant check
  • 43b540d14 Execution,Runtime: Update canister log metrics
  • 627e4bb56 Execution,Runtime: Add canister log metric for log size
  • 7e94e17cb NNS,Execution(registry): Add generalized ChainKeySigningSubnetList
  • 2ed3ae984 Node: Organize and consolidate rootfs utils #4
  • 699519844 Node: Organize and consolidate rootfs utils #3
  • a7f37fc3b Runtime(fuzzing): differential wasmtime fuzzer for wasm simd determinism
  • de56c1391 Runtime: SIMD: Enable WebAssembly SIMD support

Bugfixes:

  • 9b7f96a26 Crypto: Fix running threshold Schnorr benchmarks using cargo
  • cf407fa2d Node(guest-os): start unprivileged ports from 80 instead of 79 as it is inclusive

Performance improvements:

  • a2461f6f8 Crypto: add fine-grained metrics for private and public IDKG dealing verification

Chores:

  • e4314ca27 Consensus(ecdsa): Remove masked kappa creation for quadruples
  • 84d0e6da5 Consensus: Reduce boilerplate for consensus message conversions
  • 214d3654d Crypto: don’t use the TlsPublicKeyCert internal struct when deriving the node from rustls certs
  • ebe8231ae Crypto: remove obsolete specification to tECDSA
  • 6806c655b Execution,Runtime: remove excessive total_canister_log_memory_usage metric
  • 36d617df7 Interface: Remove duplicate code for consensus message hash
  • f98222194 Message Routing: Add signals end metric
  • ffe09593d Message Routing,Runtime: Some more fine-grained metrics for merges
  • 3bcc668d3 Networking: tracing instrumentation for quic transport and consensus p2p
  • 89d400d7c Networking(http_endpoints): use axum server

Refactoring:

  • 6d36a6b5c Consensus: Merge imports of ic-recovery
  • 9e27a9e72 Consensus: remove unused priorities
  • b60a3024b Crypto: use directly rustls instead of using it via tokio_rustls
  • 6f316c9a6 Crypto: remove CSP layer for sig creation in tECDSA

Tests:

  • c5439d886 Consensus: Check consensus bounds in test framework
  • d0e6ec77b Consensus: Display the state hash of the latest CUP on the orchestrator dashboard
  • 72fe43f3c Consensus: Support named fields in enum variants for exhaustive derive
  • cec83184e Crypto: make some crypto test utils and dependents testonly
  • 9dfa09421 Crypto: split tECDSA/IDKG integration tests
  • 5228aafd8 Interface: Add protobuf round trip tests for some structured enums
  • 3d3850c3c Runtime: SIMD: Add NaN canonicalization test

Other changes:

  • 4e30ba322 Boundary Nodes,Node: Update relocated rootfs references
  • dcabf8592 Node: Updating container base images refs [2024-05-09-0815]
  • 1048c55ae Node: Updating container base images refs [2024-05-08-0632]
  • 235b099e0 Node: Updating container base images refs [2024-05-02-0815]

IC-OS Verification

To build and verify the IC-OS disk image, run:

# From https://github.com/dfinity/ic#verifying-releases
sudo apt-get install -y curl && curl --proto '=https' --tlsv1.2 -sSLO https://raw.githubusercontent.com/dfinity/ic/2c4566b7b7af453167785504ba3c563e09f38504/gitlab-ci/tools/repro-check.sh && chmod +x repro-check.sh && ./repro-check.sh -c 2c4566b7b7af453167785504ba3c563e09f38504

The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.

3 Likes

Hello there!

We are happy to announce that voting is now open for a new IC release.
The NNS proposal is here: IC NNS Proposal 129697.

Here is a summary of the changes since the last release:

Release Notes for release-2024-05-09_23-02-storage-layer (9866a6f5cb43c54e3d87fa02a4eb80d0f159dddb)

Changelog since git revision 2c4566b7b7af453167785504ba3c563e09f38504

Features:

  • 9866a6f5c Interface: Enable new storage layer

IC-OS Verification

To build and verify the IC-OS disk image, run:

# From https://github.com/dfinity/ic#verifying-releases
sudo apt-get install -y curl && curl --proto '=https' --tlsv1.2 -sSLO https://raw.githubusercontent.com/dfinity/ic/9866a6f5cb43c54e3d87fa02a4eb80d0f159dddb/gitlab-ci/tools/repro-check.sh && chmod +x repro-check.sh && ./repro-check.sh -c 9866a6f5cb43c54e3d87fa02a4eb80d0f159dddb

The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.

3 Likes

Reviewers for the CodeGov project have completed our review of these replica updates.

Proposal ID: 129696
Vote: ADOPT
Full report on OpenChat

Proposal ID: 129697
Vote: ADOPT
Full report on OpenChat

At the time of this comment on the forum there are still 2 days left in the voting period, which means there is still plenty of time for others to review the proposal and vote independently.

We had several very good reviews of the Release Notes on these proposals by @Zane, @cyberowl, @ZackDS, @massimoalbarello, @ilbert, @hpeebles, and @Lorimer. The IC-OS Verification was also performed by @jwiegley, @tiago89, and @Gekctek. I recommend folks take a look and see the excellent work that was performed on these reviews by the entire CodeGov team. Feel free to comment here or in the thread of each respective proposal in our community on OpenChat if you have any questions or suggestions about these reviews.

1 Like

Hello there!

We are happy to announce that voting is now open for a new IC release.
The NNS proposal is here: IC NNS Proposal 129706.

Here is a summary of the changes since the last release:

Release Notes for release-2024-05-09_23-02-hotfix-tecdsa (30bf45e80e6b5c1660cd12c6b554d4f1e85a2d11)

Changelog since git revision 2c4566b7b7af453167785504ba3c563e09f38504

Bugfixes:

  • 30bf45e80 Consensus(schnorr): Revert ’ Make MasterPublicKey in EcdsaReshareRequest mandatory’

IC-OS Verification

To build and verify the IC-OS disk image, run:

# From https://github.com/dfinity/ic#verifying-releases
sudo apt-get install -y curl && curl --proto '=https' --tlsv1.2 -sSLO https://raw.githubusercontent.com/dfinity/ic/30bf45e80e6b5c1660cd12c6b554d4f1e85a2d11/gitlab-ci/tools/repro-check.sh && chmod +x repro-check.sh && ./repro-check.sh -c 30bf45e80e6b5c1660cd12c6b554d4f1e85a2d11

The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.

1 Like

Related:

3 Likes

Hi,

Are you able to share any further information about the event that occurred that brought the II and XRC canisters down and/or required them to be taken offline?

I gather that the schnorr commit that made MasterPublicKey mandatory was involved in this issue, given that it’s reverted in this latest release as the only change. Are you able provide details about how the issue unfolded?

I also have a couple of side questions if that’s okay:

  • I see that uzr34 is often the first subnet to receive a particular version of the IC-OS releases (the version that has the new storage layer feature disabled - for now). Given that this is a critical system subnet, I suspect there’s a good reason that new IC-OS releases rollout to it as a priority (rather than testing the waters first with some less critical subnets). Presumably this is due to the dependency that other subnets have on the uzr34 subnet? If you’re ablet to clarify, this information would be useful.
  • I noticed that the II canister was upgraded shortly after the incident was resolved (having already deployed a recovery catch-up package). The canister upgrade featured numerous changes. Has that II canister upgrade played a part in getting the uzr34 subnet back on track, or is the timing of that upgrade a coincidence?
  • I’ve noticed that there’s not a second IC-OS hotfix release (one that has the storage layer feature enabled). Presumably that’s because the issue experience by the uzr34 subnet is not applicable to any of the subnets that have had the new storage layer feature enabled. Is that correct? Any elaboration you’re able to provide about why those other subnets are not susceptible would be useful.
  • A more general question about the relationship between the new storage layer feature on/off proposal pairs; Can I ask if anyone knows why the storage layer feature flag has been implemented as a compile-time setting (requiring two different binaries, and therefore two separate IC-OS election proposals for on/off settings), rather than a runtime or install-time flag? My understanding is that the latter approach would require just one binary and fewer IC-OS election proposals (along with some other benefits)

Thanks

3 Likes

I can take the two questions regarding the storage layer.

We decided to only do one hotfix, because there are only a few subnets affected, and all these subnets affected either 1) weren’t supposed to get the feature yet, 2) can wait another week, or 3) have 0 canisters so the feature doesn’t actually do anything. So no point going through the hassle of a 4th replica version.

As for why it’s a compile time flag, we do indeed generally prefer to make it a runtime flag. In particular, many things you can change via an NNS proposal. However, that means the flag can flip at any point in a running replica, which in this case would have complicated matters too much. The advantage of a compile time flag is that it is constant from the start of the replica process onwards. It is a complex feature that affects multiple stateful components of the IC and their interaction. Adding the complexity of it changing at any point was not worth the risk vs the awkwardness of having 2 builds for a bit. Note that the fact that we have multiple weeks of the feature branch is another consequence of us being mindful of the complexity of the feature, and its associated risks.

4 Likes

Are you able to share any further information about the event

yes we’ll do a proper post mortem as always

  • I see that uzr34 is often the first subnet to receive a particular version of the IC-OS releases (the version that has the new storage layer feature disabled - for now). Given that this is a critical system subnet, I suspect there’s a good reason that new IC-OS releases rollout to it as a priority (rather than testing the waters first with some less critical subnets). Presumably this is due to the dependency that other subnets have on the uzr34 subnet? If you’re ablet to clarify, this information would be useful.

It’s often one of the first because that way we have more time between upgrading ECDSA subnets.

  • I noticed that the II canister was upgraded shortly after the incident was resolved (having already deployed a recovery catch-up package). The canister upgrade featured numerous changes. Has that II canister upgrade played a part in getting the uzr34 subnet back on track, or is the timing of that upgrade a coincidence?

This was unrelated

  • I’ve noticed that there’s not a second IC-OS hotfix release (one that has the storage layer feature enabled). Presumably that’s because the issue experience by the uzr34 subnet is not applicable to any of the subnets that have had the new storage layer feature enabled. Is that correct? Any elaboration you’re able to provide about why those other subnets are not susceptible would be useful.

The problem only appears on subnets that hold a threshold ECDSA key, and it’s fine to not have those subnets run the new storage layer this week, so we only propose a single hotfixed replica version without the new storage layer for those four subnets.

2 Likes

Thanks Stefan, this make sense. I appreciate the time you took to break it down for me. Thanks again

Thanks Manu, I really appreciate your answer. I’m mostly following. Are you able to elaborate a little bit about maximising the time between upgrading ECDSA subnets? For example, assuming Dfinity avoided deploying IC-OS updates to system subnets until later in the rollout schedule (instead, deploying to less critical application subnets first), what would the downsides be? Currently I’m not sure how many application subnets hold a threshold ECDSA key, but I’m actively working on acquiring a working knowledge of this sort of info. Thanks again.

1 Like

The two ECDSA keys that ICP holds (Secp256k1:key_1 and Secp256k1:test_key_1 are each on two different subnets, one is actively signing and one is just holding a copy. This means that if something bad happens to one subnet, the other subnet would still have a copy and is able to reshare the key to other subnets. But suppose there would be some very bad upgrade, and we upgrade both the backup and the signing subnet at the same time, now we jeopardize the ECDSA key. So to be super cautious we typically spread those upgrades out.

The subnets holding ECDSA keys are

  • uzr34 (backup key_1)
  • pzp6e (signing key_1)
  • fuqsr (backup test_key_1)
  • 2fq7c (signing test_key_1)
3 Likes

Thanks a lot for going into further details. This is really helpful info. It looks like fuqsr an (application subnet) typically receives IC-OS deployments after having already deployed the same version to uzr34 (a system subnet, which you’d expect to have a bigger impact if it’s canisters go offline). Out of curiosity, why not deploy to the less critical subnet first (to reduce the blast radius of unanticipated issues)? It doesn’t seem like this would prevent the deployments from being spaced out as you mentioned (which maximises the chances of spotting an issue before the IC-OS version has been deployed to both a backup and corresponding signing subnet). I suspect I’m missing something still, but that’s why I’m asking :slightly_smiling_face:

1 Like

No you’re not missing anything, I think this was just an oversight and we’ll change that now

2 Likes

Okay, thanks Manu. Out of interest, does the rollout schedule for new IC-OS releases tend to be planned up front?

It would be great if the rollout schedule (or just the planned subnet release order) could be included in the IC-OS election proposal description and/or included in the associated forum post (or at least a list of the subnets that are planned to receive the new release first). This would allow more eyes to be cast on the specifics of how an IC-OS release will be handled. Providing this information with the IC-OS election proposals means it’s available in advance of the deployment proposals appearing. They’re numerous and aren’t typically open for long, sometimes as little as a few seconds.

1 Like

Thanks for all the info shared relating to this incident. I have a general question about IC-OS version testing. Does Dfinity have a mechanism in place to sync their testnet with mainnet, to ensure representative testing? If so, are there limitations with how much the two environments can be synced and/or how frequently? I’m just asking mostly out of curiosity.

1 Like

That’s a great idea @Lorimer . We sadly don’t do that, and are unlikely to do it. Here is an explanation why.

First reason is that the IC Mainnet state is considered sensitive and confidential. As all node providers, DFINITY could break into the nodes (since we don’t have SEV-SNP enabled yet), to access subnet state(s) and copy the subnet state to some other machines. However, this wouldn’t be seen in positive light either internally or in the community, since we should be setting an example here.

Second reason is that the subnet state can be pretty big (e.g. 100s of GB) and copying over subnet states could take very long time, from hours to days, additionally slowing down the rollout process for new versions.

As a part of the post mortem discussion, we will review some other options for improving the success rate of subnet upgrades and we’ll share that with the community and ask for feedback before we start the work.

2 Likes

Thanks @sat. Presumably the privacy concerns don’t apply to system subnets, which also have significantly fewer canisters (5 or 13, instead of many thousands), so size constraints shouldn’t be too prohibitive either. Am I still missing something?

We sadly don’t do that, and are unlikely to do it

Would Dfinity not consider doing this just for system subnets (which are obviously the most critical)? Also, what about subnet config/payload state (as opposed to canister state) - presumably this could be synced without running into size or privacy concerns regarding application subnets?

Anything that can be done to automatically keep tests relevant and effective seems worth perusing. I hope you don’t mind all the questions. I learn a lot from the answers :slightly_smiling_face:

1 Like

Actually the privacy concerns apply even more for the system subnets. For instance, on the internet identity subnet a malicious actor could examine canister heap and make some analysis and guesses about the authentication of some user accounts. It’s not extremely dangerous but it wouldn’t be desirable either. Internet identity is used on many other canisters and subnets so I suppose a fair amount of data could be mined.

For the nns subnet, one could be concerned about the neuron info. Again not overly concerning, but it’s a risk.

What’s being discussed now as potential strategic activities is automatical subnet rollback in case an upgrade fails, and maybe read only (canary) nodes that could be upgraded to the new version before the rest of the subnet. Both approaches have pros and cons and both have development cost, so we’ll have to carefully evaluate them, considering other high priority development that engineers are working on.

3 Likes

Thanks for the addtional explanation @sat, I’m following now. I like the sound of the other ideas that are floating around. Thanks for sharing

2 Likes

Hi @stefan.schneider, thanks again for the information that you’ve shared regarding the LSMT feature. The NNS incident yesterday made me revisit this post. I gather that the NNS subnet (tdb26) was the only subnet running GuestOS version b9a0f18 when the incident occurred, and that the hotfix rollout (election and deployment) is essentially the same GuestOS version but with the LSMT feature switched off (which the NNS is now running).

Getting the hotfix in obviously required two proposals due to a suitable GuestOS binary not already existing in the registry. Every time a GuestOS election gets pushed as a hotfix without community scrutiny and voting, it adds fuel to the fire that the FUDers like to feed (i.e. that IC isn’t really decentralised and it’s all just theatre). But of course, this needed fixing right away. Theoretically, if the LSMT feature had been a runtime flag, I gather there would have been no need for an election proposal, nor a GuestOS deployment proposal. There would only have needed to be a subnet config update proposal as a hotfix (the scope of which seems smaller).

My understanding from your previous response is that a runtime flag was considered, but avoided due to the dangers of it flipping during the course of a replica’s execution. Doesn’t this depend entirely on how the GuestOS reads the config? Couldn’t it be implemented such that the GuestOS reads certain runtime config only at startup? An NNS function that supports updating the subnet config and restarting the replica could then have been used to deliver this hotfix (without any danger of the LSMT feature flipping during the course of execution).

I’m asking only to learn a little more about IC infrastructure, and hopefully to stimulate conversation about how to make certain hotfixes smoother and more palatable for a wider audience.

On a related note, I gather that if the storage limit had been increased already then switching off LSMT could have been riskier (resulting in performance issues). Are you able to share when you’re planning to increase the storage limit (hopefully not for a while, so there’s time for any other lurking issues to surface)? Will it be increased in one fell swoop, or in small increments?

Thanks in advance, your insights are always appreciated :pray:

2 Likes