Please note that some commits may be excluded from this release if they’re not relevant, or not modifying the GuestOS image. Additionally, descriptions of some changes might have been slightly modified to fit the release notes format.
To see a full list of commits added since last release, compare the revisions on GitHub.
Features:
c03395b6a Consensus,Interface: Add zst compatibility for NNS subnet recovery (#711)
a106c0cf3 Crypto,Interface(crypto): Add BIP340 support to secp256k1 utility crate (#844)
85f58e976 Execution,Interface: Add charging for take and load canister snapshot (#811)
7d83b8b09 Execution,Interface: Include snapshot memory usage in canister’s memory usage (#857)
af55769b6 Execution,Interface(IDX): run execution tests on namespace (#848)
f04035051 Execution,Interface: migrate replica to no-op LogVisibilityV2 (#768)
d33903cb1 Interface(PocketIC): query statistics in the management canister (#677)
696829913 Interface,Message Routing: Add functionalities of creating and removing unverified checkpoint markers (#657)
The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
Please note that some commits may be excluded from this release if they’re not relevant, or not modifying the GuestOS image. Additionally, descriptions of some changes might have been slightly modified to fit the release notes format.
To see a full list of commits added since last release, compare the revisions on GitHub.
The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
Not sure what Leo Eichhorn’s handle is on this forum (if anyone knows it would be great if you could tag him). I love that this commit minimises the number of CUP fields that the orchestrator needs to deserialise in order to remain functional. I wonder if depending on NiDkgId for subnet ID is absolutely necessary. It would be great if the orchestrator only needed the CUP block height (which is a primitive type, unlike NiDkgId).
Are there not other ways for the orchestrator to obtain the subnet ID that it’s running on (such as asking the registry)? It seems avoiding a dependency on the CUP for retrieving the subnet ID would help with being able to update SSH access in disaster recovery scenarios.
Presumably this was picked up as a GuestOS change due to the change to rs/consensus/src/consensus/payload_builder.rs? This is an example of one of the reasons I think it would be useful for committers to get into the habit of labelling their commits with the affected deployment targets (so they’re mindful of the change logs their commits will show up in, hopefully making them consider splitting their change into multiple commits to reduce unnecessary noise in certain change logs). For reader’s reference.
Hi @berestovskyy, are you able to share some of the considerations that went into making this a compile time flag? In particular, is there a rough idea of how long this flag will be needed?
Unfortunately, that’s the only option we have available now to enable a feature on a specific subnet. Changing the flag runtime very rarely makes sense, as backward/forward compatibility is only considered during state serialization/deserialization.
Additionally, selecting a new set of flags for a subnet should probably also require a vote. So, IMO, the process of voting for new flags and restarting all subnet nodes doesn’t offer much advantage over simply voting for a specific replica version.
In the linked thread, you mention the difficulty in the hotfix review process. IMO we should always have versions ready with and without the flag, so there should be no need for hotfixes to disable a flag. I might be missing something, though. @sat is the right person to ask release questions.
Regarding the timeline, canister snapshots are internally limited by LSMT performance. We’ll gradually roll out the flag from less busy to more busy subnets while monitoring performance. We’ll see how it goes, but the best-case scenario it’ll take a few release cycles, i.e., a few weeks.
The proposal scope needed to switch on/off a feature flag shouldn’t need to cover altering anything and everything about the GuestOS. A lower nakamoto coefficient is much easier to justify and accept for proposals that have smaller scope
I think it’s the scope of the needed proposals (election and deployment) that’s an issue during disaster recovery, where time is of the essence and a fix needs executing as fast as possible (but without sacrificing unnecessarily on governance security).
At the moment DFINITY rely on having sufficient voting power to force a replica version into the registry in order to recover from disasters promptly. I don’t think this will or should always be the case in the future. This relates to a recent discussion regarding subnet recovery.
But for how long? Following this approach shouldn’t we still be electing LSTM-disabled copies of every replica version. Wouldn’t the number of proposals needed quickly explode (with numerous combinations of on/off feature compilations). Note that in the hotfix discussed in the thread that I linked to, if memory serves, the LSMT feature had already been enabled on all subnets (and this is the only reason an LSTM-disabled replica was no longer being proposed and elected each week).
However, if we keep the commit 69682991 as reference (which only has the head of branch rc–2024-08-15_01-30 associated to it, without any tag) the changes listed seem to be ok.
you’re almost certainly right that caused the change. it would be nice for devs to split this up into multiple commits. we could certainly let the devs know which deployments is their change affecting by running some CI. still, enforcing splitting up commits would be difficult, but i strongly agree in principle that such changes should be split. even so, for this particular change, i’d still probably not want to make anyone go through CI twice just because of 2 line change
Great observation, thanks for the question @Lorimer!
Are there not other ways for the orchestrator to obtain the subnet ID that it’s running on (such as asking the registry)?
For assigned nodes, the source of truth for subnet membership is the latest CUP. Membership changes are part of the protocol and do not happen immediately when the registry changes. In particular when a node is unassigned by the registry, it might still need to take part in consensus for a while until all the tasks where its membership is expected are finished, and enough new nodes have caught up to the latest state of the subnet. During this time the orchestrator shouldn’t stop the replica or delete its state yet, even if the node is already unassigned according to the registry.
It seems avoiding a dependency on the CUP for retrieving the subnet ID would help with being able to update SSH access in disaster recovery scenarios.
In disaster recovery scenarios where the orchestrator isn’t broken updating SSH access shouldn’t be a problem. If the orchestrator is broken due to an incompatibility between the latest CUP and the new replica version, then we shouldn’t need SSH readonly access as the correct subnet state is already present on all nodes after the failed upgrade. In this case we can immediately propose to revert to the previous replica version.
Agreed But they don’t need to go through CI to establish the deployment targets do they? Just a bazel build, which I assume is an incremental build.
Thanks @Luka. Based on our discussions relating to this, and also discussions with @sat, I’m not seeing a significant downside with requesting committers to provide a comment in the commit message about what the deployment targets are for their commit (other than a little bit of extra time). If they don’t do it, no worse than now. If they do do it, they’ll be more likely to be mindful of the change logs they’ll be affecting, the proposals their changes will show up in, and the systems their change will be affecting. The only downside is that it may cost them an extra few minutes to confirm their assumptions about deployment targets (surely they should be mindful of these anyway).
It almost sounds like there’s a organisational culture change that would need to take place for DFINITY to get it’s engineers to practice doing this. @basvandijk do you have thoughts about this?
Thanks a lot for explaining this further @eichhorl, I really appreciate it!
Would it be worth adding a dedicated subnet ID field to the CUP (as a primitive type, much like the block height)? This would further reduce the number of fields to a minimum that need deserialising (and therefore the fields that the orchestrator is critically dependent on between upgrades). Obviously complex types have their own nested fields that require further deserialization.
Principals like the subnet ID are currently represented as raw bytes so the conversion would still be fallible even if we had a dedicated field. I think, in contrast to the block payload as a whole, breaking changes to subnet IDs or NiDKG IDs are unlikely and more easily controlled.
it’s not so simple unfortunately. it would still likely take much longer than a single PR. especially if someone else merges a change in the meantime. then still a lot needs to be rebuilt. and the most expensive parts are the tests which would most likely trigger for both cases.
it’s not that it would be hard to change this or that it would disrupt developer’s workflows so much. i just doubt it would be valuable enough. i think it would be worse than what we have now because it’s non-enforceable. devs can do “chore” style commits that do not modify deployments. they can modify their PRs throughout its lifecycle so that deployments target change to what they initially put. it’s not about developers not knowing what they’re changing, it’s about the UX of putting this information into their change. it’s easy to make a mistake in this workflow. if we had 99% accuracy per commit in this workflow (which i think it’s already extremely high), for a release with 50 commits, there’s only a 60% chance that we get accurate release notes.
i’m confident that the current approach we’re trying to implement will result in highly accurate release notes. we can reliably detect if a commit changed the guestOS build (with target-determinator as opposed to what we’re doing now which has some drawbacks), and then we’ll able to label some known packages as not relevant to guestOS changes. finally, we’ll make all the commits available in the release notes in one way or the other so that they can all be reviewed. i think labeling might still be not as accurate as we’d want at times (i.e. if a change is relevant or not), but we’ll 100% know which commits indeed modify the guestos deployment.
i think if we still do not get good results from this, we can try your suggestion and see if we can get some improvements with it.
Thanks @Luka. All sounds good, but in this instance I’m just talking about how to encourage appropriate splitting of commits (to avoid fat commits that add unnecessary noise to proposal change logs).
What I’m really saying here is how do we encourage developers to think about this when they’re committing? Once they’ve committed, it’s committed…