c2ae94708 Execution: Cleanup formatting iDKG keys in tests
ce2222b6c IDX,Networking: add testonly to crypto test utils and adjust the dependents
ce82b5e26 Message Routing,Crypto(crypto): make tests and benchmarks in //rs/certification/… reproducible
9e637ff67 Networking: disable jaeger outside system tests
839b98b82 Networking(http-endpoint): Test that the http/1.1 ALPN header is set.
a5e0b84b2 Node: Update bare-metal-test IP addresses
618441d6b Node,DRE: remove the unused directory /testnet/tests/
Other changes:
1a83813dc Boundary Nodes,Crypto,Execution,Runtime,Networking,Message Routing: add quinn-udp external dep in preparation of the rustls upgrade and bump the versions of some core external deps
The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
Can I ask if you’re able to provide an estimate on how long it’s likely to take to address the XNET issue (and when we should expect the temporary increase to the max number of allowed connections to be reverted)? It seems this increases potential for DoS / resource exhaustion in the meantime.
On a separate note, changes to improve test reproducibility seem good, but this proposal also seems to include changes that could theoretically reduce test reproducibility (5183b96ee which switches use of fixed/deterministic UNIX_EPOCH in favour of non-deterministic SystemTime::now()). I’m not sure why this is being done (the commit message didn’t make it clearer - ‘change some test code’ - the merge commit is clearer, but still focused on the what rather than the why) so I thought I’d go ahead and ask.
You’re right that there’s not a lot of context provided in the original commit message. The corresponding merge commit that applies the changes to the master branch tends to provide more context (in this case it’s this one). As I understand it there’s an issue with inter-subnet comms hitting the current limit, so it’s being temporarily increased until the issue can be resolved. Some further questions:
In a worst case scenario, what happens if the 1000 limit is also expended by the XNET issue? What are the downsides with leaving it as it is and fast-tracking the XNET fix (rather than accommodating the issue in the meantime).
More generally, would it be worth adjusting proposal summaries so that they reference the merge commits instead of the original commit, given that these tend to have far more informative commit messages that the community will benefit from having easy access to (as evidenced by DGDG’s question)
At the time of this comment on the forum there are still 2 days left in the voting period, which means there is still plenty of time for others to review the proposal and vote independently.
We had several very good reviews of the Release Notes on these proposals by @Zane, @cyberowl, @ZackDS, @massimoalbarello, @ilbert, @hpeebles, and @Lorimer. The IC-OS Verification was also performed by @tiago89. I recommend folks take a look and see the excellent work that was performed on these reviews by the entire CodeGov team. Feel free to comment here or in the thread of each respective proposal in our community on OpenChat if you have any questions or suggestions about these reviews.
The two SHA256 sums printed above from a) the downloaded CDN image and b) the locally built image, must be identical, and must match the SHA256 from the payload of the NNS proposal.
I’m not even sure what happened here, other than that the LSMT feature (new storage layer) is involved. Here’s some related discussion if you’re interested.
I expect DFINITY will provide a detailed post-mortem when they have time.
(this post was in response to a question from DGDG that has since been deleted, asking if I’d anticipated this issue)
Reviewers for the CodeGov project have completed our review of these replica updates.
Proposal ID: 130399
Vote: REJECT Full report on OpenChat
NOTE: The CodeGov followed the lead of DFINITY regarding which of these identical proposals to reject. Thank you for letting us know your plans.
At the time of this comment on the forum there are still 2 days left in the voting period, which means there is still plenty of time for others to review the proposal and vote independently.
Apologies for the weak communication here. There will be a postmortem yes, and the communication will be an important part of the discussion in the post mortem (internally at least).
My view is basically that the communication especially in such cases needs to be superb and yet the very same engineers who are involved in the technical process of the recovery cannot be responsible for the communication as well since they will be occupied by the recovery.
So we clearly need to assign a separate person to do communication. We’ll have to discuss who can do this and how.
Note that this is my person view, not the official view of the Foundation.
Communication aside, what happened here was that the performance of the nns subnet was periodically getting bad. It started happening once per day from Tuesday, for a short time, then on Thursday it got bad for maybe 20 min (technical details will be in the post mortem), and on yhen Friday we again thought it would be again short but it actually got so bad that it couldn’t recover for multiple hours because it couldn’t even handle regular ingress traffic.
We prepared a new build that has the new storage layer disabled and then I’ve been trying to submit but the nns subnet was not responding after the submission so I ended up submitting two proposals. Only one would be able to be executed due to the invariants on the registry canister but we prefer to be on the safe side (we haven’t recently exercised this invariant in real life, just in tests), so we voted reject on one.
Disabling the new storage layer solved the problem. The engineers have already improved the performance of the new storage layer but it hasn’t been thoroughly tested so we’ll probably skip the nns subnet in the next rollout, to be on the safe side. We’ll share more updates.
In parallel we have been having discussions internally for the nns subnet fault tolerance over the last couple of months. This event might result in the prioritization of that work. We’ll see.