Proposal to elect new release rc--2024-08-15_01-30

Hello there!

We are happy to announce that voting is now open for a new IC release.
The NNS proposal is here: IC NNS Proposal 131757.

Here is a summary of the changes since the last release:

Release Notes for release-2024-08-15_01-30-base (6968299131311c836917f0d16d0b1b963526c9b1)

This release is based on changes since release-2024-08-08_07-48-base (94fd38099f0e63950eb5d5673b7b9d23780ace2d).

Please note that some commits may be excluded from this release if they’re not relevant, or not modifying the GuestOS image. Additionally, descriptions of some changes might have been slightly modified to fit the release notes format.

To see a full list of commits added since last release, compare the revisions on GitHub.

Features:

  • c03395b6a Consensus,Interface: Add zst compatibility for NNS subnet recovery (#711)
  • a106c0cf3 Crypto,Interface(crypto): Add BIP340 support to secp256k1 utility crate (#844)
  • 85f58e976 Execution,Interface: Add charging for take and load canister snapshot (#811)
  • 7d83b8b09 Execution,Interface: Include snapshot memory usage in canister’s memory usage (#857)
  • af55769b6 Execution,Interface(IDX): run execution tests on namespace (#848)
  • f04035051 Execution,Interface: migrate replica to no-op LogVisibilityV2 (#768)
  • d33903cb1 Interface(PocketIC): query statistics in the management canister (#677)
  • 696829913 Interface,Message Routing: Add functionalities of creating and removing unverified checkpoint markers (#657)
  • 4a6cdcc47 Node: Consolidate manageboot.sh (#819)

Bugfixes:

  • ce460e975 Consensus,Execution,Interface(consensus): Use priority in the payload builder to decide to trigger edge-case (#799)
  • 7fce52639 Consensus,Interface: Make the orchestrator recover from having a broken local CUP (#572)
  • 8d4b09372 Consensus,Interface(IDX): don’t cargo build rocksdb on Linux (#861)
  • fb3b4e0c1 Execution,Interface(execution): Update doc comment (#840)
  • af7db79a8 Execution,Interface: Store the version of the snapshot that was loaded in canister history (#837)
  • f44cbb6bf Interface,Message Routing: Subnet splitting with canister snapshots (#800)
  • 703c513ae Node: disable node exporter netlink metrics collection (#826)

Performance improvements:

  • 21b0354a4 Execution,Interface: Add canister snapshot benchmarks (#849)

Chores:

  • 3a58c7f7d General: tokio upgrade (#889)
  • 40fc451f5 Consensus,Interface: sort dependencies in consensus Cargo.toml files (#832)
  • 58beadb5c Consensus,Interface,Networking(consensus): [Con-1226] only validate open request context (#723)
  • 61508b045 Consensus,Interface,Networking(consensus): only validate one http outcall share per node per request (#700)
  • 9a210aa85 Crypto,Interface(crypto): Address comments from Schnorr security review (#845)
  • 983e13053 Crypto,Interface: sort dependencies in crypto Cargo.toml files (#835)
  • 0f2c72cc5 Execution,Interface: Execute the consensus queue even if the limits are reached (#881)
  • 1bfbe3156 Execution,Interface: Upgrade wasmtime to v23 (#825)
  • 875532045 Execution,Interface: Remove hypervisor_wasm_num_tables metric (#823)
  • 2e21c9674 Execution,Interface: sort dependencies in execution Cargo.toml files (#831)
  • 69a4cee60 Interface: introduce boundary-node-pkg package group (#640)
  • 935615127 Interface: actix upgrade (#896)
  • ca24b5d66 Interface: sort dependencies in Cargo.toml files (#828)
  • 723a554ab Interface,Message Routing: sort dependencies in message routing Cargo.toml files (#834)
  • 971d43772 Interface,Networking: improve consensus manager update handler errors (#790)
  • e2462f7fc Interface,Networking: sort dependencies in networking Cargo.toml files (#833)
  • 32b662e5f Interface,Networking(http_endpoint): put compute intesive code in /read_state into spawn_blocking (#587)
  • 2225ef32e Node: Reduplicate boundary-guestos files (#940)
  • 51010fbef Node: Remove SetupOS SELinux policy (#900)
  • 9d10a36c4 Node: Update Base Image Refs [2024-08-08-0807] (#827)

Refactoring:

  • 0e0f146d2 Interface: move reserved fields to the bottom of the message definitions (#496)
  • e6d756942 Interface,Networking(quic-transport): Create constants for quic parameters (#909)

Tests:

  • 957689f55 Interface,Message Routing: New test setup for stream handler tests (#564)
  • 206a22fd1 Consensus,Interface(consensus): Only consider summary blocks in CUP exhaustive test (#797)
  • 317257904 Execution,Interface: Enable canister snapshots in tests (#918)
  • 4fa5156b1 Execution,Interface: Make memory test more robust (#812)

IC-OS Verification

To build and verify the IC-OS disk image, run:

# From https://github.com/dfinity/ic#verifying-releases
sudo apt-get install -y curl && curl --proto '=https' --tlsv1.2 -sSLO https://raw.githubusercontent.com/dfinity/ic/6968299131311c836917f0d16d0b1b963526c9b1/gitlab-ci/tools/repro-check.sh && chmod +x repro-check.sh && ./repro-check.sh -c 6968299131311c836917f0d16d0b1b963526c9b1

The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.

3 Likes

Hello there!

We are happy to announce that voting is now open for a new IC release.
The NNS proposal is here: IC NNS Proposal 131787.

Here is a summary of the changes since the last release:

Release Notes for release-2024-08-15_01-30-canister-snapshots (1ac5439c6da1aafe8156c667c313344c0245fea3)

This release is based on changes since release-2024-08-15_01-30-base (6968299131311c836917f0d16d0b1b963526c9b1).

Please note that some commits may be excluded from this release if they’re not relevant, or not modifying the GuestOS image. Additionally, descriptions of some changes might have been slightly modified to fit the release notes format.

To see a full list of commits added since last release, compare the revisions on GitHub.

Features:

  • 1ac5439c6 Execution,Interface,Networking: Enable canister snapshots

IC-OS Verification

To build and verify the IC-OS disk image, run:

# From https://github.com/dfinity/ic#verifying-releases
sudo apt-get install -y curl && curl --proto '=https' --tlsv1.2 -sSLO https://raw.githubusercontent.com/dfinity/ic/1ac5439c6da1aafe8156c667c313344c0245fea3/gitlab-ci/tools/repro-check.sh && chmod +x repro-check.sh && ./repro-check.sh -c 1ac5439c6da1aafe8156c667c313344c0245fea3

The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.

3 Likes

Thanks for this release @DRE-Team!

I’ve found this to be a particularly interesting commit → fix: CON-1321 Make the orchestrator recover from having a broken loca… · dfinity/ic@7fce526 (github.com)

Not sure what Leo Eichhorn’s handle is on this forum (if anyone knows it would be great if you could tag him). I love that this commit minimises the number of CUP fields that the orchestrator needs to deserialise in order to remain functional. I wonder if depending on NiDkgId for subnet ID is absolutely necessary. It would be great if the orchestrator only needed the CUP block height (which is a primitive type, unlike NiDkgId).

Are there not other ways for the orchestrator to obtain the subnet ID that it’s running on (such as asking the registry)? It seems avoiding a dependency on the CUP for retrieving the subnet ID would help with being able to update SSH access in disaster recovery scenarios.

2 Likes

Presumably this was picked up as a GuestOS change due to the change to rs/consensus/src/consensus/payload_builder.rs? This is an example of one of the reasons I think it would be useful for committers to get into the habit of labelling their commits with the affected deployment targets (so they’re mindful of the change logs their commits will show up in, hopefully making them consider splitting their change into multiple commits to reduce unnecessary noise in certain change logs). For reader’s reference.

2 Likes

Looks like there are some branch naming issues causing the above ^

This was noticed by @ZackDS. We’ve been discussing it on OC (and he’s okay with me posting on his behalf). Well spotted Zack! :heart_on_fire:

2 Likes

Hi @berestovskyy, are you able to share some of the considerations that went into making this a compile time flag? In particular, is there a rough idea of how long this flag will be needed?

See here for a recent discussion I had with @stefan.schneider regarding the LSMT feature flag (regarding concerns I have with compile-time feature flags, particularly during disaster recovery scenarios) → Proposal to elect new release rc–2024-05-09_23-02 - Governance - Internet Computer Developer Forum (dfinity.org)

2 Likes

Hey @Lorimer,

Unfortunately, that’s the only option we have available now to enable a feature on a specific subnet. Changing the flag runtime very rarely makes sense, as backward/forward compatibility is only considered during state serialization/deserialization.

Additionally, selecting a new set of flags for a subnet should probably also require a vote. So, IMO, the process of voting for new flags and restarting all subnet nodes doesn’t offer much advantage over simply voting for a specific replica version.

In the linked thread, you mention the difficulty in the hotfix review process. IMO we should always have versions ready with and without the flag, so there should be no need for hotfixes to disable a flag. I might be missing something, though. @sat is the right person to ask release questions.

Regarding the timeline, canister snapshots are internally limited by LSMT performance. We’ll gradually roll out the flag from less busy to more busy subnets while monitoring performance. We’ll see how it goes, but the best-case scenario it’ll take a few release cycles, i.e., a few weeks.

4 Likes

Thanks for the quick response @berestovskyy

The proposal scope needed to switch on/off a feature flag shouldn’t need to cover altering anything and everything about the GuestOS. A lower nakamoto coefficient is much easier to justify and accept for proposals that have smaller scope

I think it’s the scope of the needed proposals (election and deployment) that’s an issue during disaster recovery, where time is of the essence and a fix needs executing as fast as possible (but without sacrificing unnecessarily on governance security).

At the moment DFINITY rely on having sufficient voting power to force a replica version into the registry in order to recover from disasters promptly. I don’t think this will or should always be the case in the future. This relates to a recent discussion regarding subnet recovery.

But for how long? Following this approach shouldn’t we still be electing LSTM-disabled copies of every replica version. Wouldn’t the number of proposals needed quickly explode (with numerous combinations of on/off feature compilations). Note that in the hotfix discussed in the thread that I linked to, if memory serves, the LSMT feature had already been enabled on all subnets (and this is the only reason an LSTM-disabled replica was no longer being proposed and elected each week).

Thanks @berestovskyy, this info is much appreciated :pray:

2 Likes

I’ve noticed something weird with proposal 131757.

The title mentions release-2024-08-15_01-30-base, but that’s not the tag corresponding to the commit 6968299131311c836917f0d16d0b1b963526c9b1 included in the title. It looks like the tag release-2024-08-15_01-30-base was created by mistake on the previous release-2024-08-08_07-48-base tag. In this sense, the proposal’s summary title is wrong.

However, if we keep the commit 69682991 as reference (which only has the head of branch rc–2024-08-15_01-30 associated to it, without any tag) the changes listed seem to be ok.

In this an error of the CI?

3 Likes

I just saw this comment.

My comment adds a few more details to this issue: Proposal to elect new release rc--2024-08-15_01-30 - #9 by ilbert

4 Likes

will fix the tag

was caused by a copy/paste error: chore: Fix typo in the release-index version · dfinity/dre@7e17729 · GitHub

but ultimately it’s a problem that this tag is not getting repushed after a change in config.

2 Likes

you’re almost certainly right that caused the change. it would be nice for devs to split this up into multiple commits. we could certainly let the devs know which deployments is their change affecting by running some CI. still, enforcing splitting up commits would be difficult, but i strongly agree in principle that such changes should be split. even so, for this particular change, i’d still probably not want to make anyone go through CI twice just because of 2 line change :smiley:

1 Like

You are completely right @ilbert . The copy&paste bug was actually my bad – I apologise. Humans make mistakes. But we also fix some. :smiley:

Alright, the tag should be fixed now in GitHub. PTAL.

2 Likes

Great observation, thanks for the question @Lorimer!

Are there not other ways for the orchestrator to obtain the subnet ID that it’s running on (such as asking the registry)?

For assigned nodes, the source of truth for subnet membership is the latest CUP. Membership changes are part of the protocol and do not happen immediately when the registry changes. In particular when a node is unassigned by the registry, it might still need to take part in consensus for a while until all the tasks where its membership is expected are finished, and enough new nodes have caught up to the latest state of the subnet. During this time the orchestrator shouldn’t stop the replica or delete its state yet, even if the node is already unassigned according to the registry.

It seems avoiding a dependency on the CUP for retrieving the subnet ID would help with being able to update SSH access in disaster recovery scenarios.

In disaster recovery scenarios where the orchestrator isn’t broken updating SSH access shouldn’t be a problem. If the orchestrator is broken due to an incompatibility between the latest CUP and the new replica version, then we shouldn’t need SSH readonly access as the correct subnet state is already present on all nodes after the failed upgrade. In this case we can immediately propose to revert to the previous replica version.

1 Like

Agreed :smiley: But they don’t need to go through CI to establish the deployment targets do they? Just a bazel build, which I assume is an incremental build.

Thanks @Luka. Based on our discussions relating to this, and also discussions with @sat, I’m not seeing a significant downside with requesting committers to provide a comment in the commit message about what the deployment targets are for their commit (other than a little bit of extra time). If they don’t do it, no worse than now. If they do do it, they’ll be more likely to be mindful of the change logs they’ll be affecting, the proposals their changes will show up in, and the systems their change will be affecting. The only downside is that it may cost them an extra few minutes to confirm their assumptions about deployment targets (surely they should be mindful of these anyway).

It almost sounds like there’s a organisational culture change that would need to take place for DFINITY to get it’s engineers to practice doing this. @basvandijk do you have thoughts about this?

Thanks a lot for explaining this further @eichhorl, I really appreciate it!

Would it be worth adding a dedicated subnet ID field to the CUP (as a primitive type, much like the block height)? This would further reduce the number of fields to a minimum that need deserialising (and therefore the fields that the orchestrator is critically dependent on between upgrades). Obviously complex types have their own nested fields that require further deserialization.

Principals like the subnet ID are currently represented as raw bytes so the conversion would still be fallible even if we had a dedicated field. I think, in contrast to the block payload as a whole, breaking changes to subnet IDs or NiDKG IDs are unlikely and more easily controlled.

1 Like

it’s not so simple unfortunately. it would still likely take much longer than a single PR. especially if someone else merges a change in the meantime. then still a lot needs to be rebuilt. and the most expensive parts are the tests which would most likely trigger for both cases.

it’s not that it would be hard to change this or that it would disrupt developer’s workflows so much. i just doubt it would be valuable enough. i think it would be worse than what we have now because it’s non-enforceable. devs can do “chore” style commits that do not modify deployments. they can modify their PRs throughout its lifecycle so that deployments target change to what they initially put. it’s not about developers not knowing what they’re changing, it’s about the UX of putting this information into their change. it’s easy to make a mistake in this workflow. if we had 99% accuracy per commit in this workflow (which i think it’s already extremely high), for a release with 50 commits, there’s only a 60% chance that we get accurate release notes.

i’m confident that the current approach we’re trying to implement will result in highly accurate release notes. we can reliably detect if a commit changed the guestOS build (with target-determinator as opposed to what we’re doing now which has some drawbacks), and then we’ll able to label some known packages as not relevant to guestOS changes. finally, we’ll make all the commits available in the release notes in one way or the other so that they can all be reviewed. i think labeling might still be not as accurate as we’d want at times (i.e. if a change is relevant or not), but we’ll 100% know which commits indeed modify the guestos deployment.

i think if we still do not get good results from this, we can try your suggestion and see if we can get some improvements with it.

2 Likes

Thanks @Luka. All sounds good, but in this instance I’m just talking about how to encourage appropriate splitting of commits (to avoid fat commits that add unnecessary noise to proposal change logs).

What I’m really saying here is how do we encourage developers to think about this when they’re committing? Once they’ve committed, it’s committed…