Blessing/Electing Replica Binary (NNS proposal #26833)

Hey folks,

I told the community that I would try to bring more attention (and explanations) to blessing/electing replica binaries.

I want to give attention to the latest one submitted: Internet Computer Network Status

Few notes:

  1. Once a binary is blessed/elected, there are subsequent NNS proposals to update each subnet. Each subnet requires an NNS proposal to update its running code. The last subnet updated is the NNS subnet. We usually make the proposals over the course of days so that things can be rolled back in case of any issue with a binary. Typically a new binary is out every 1-2 weeks.

  2. Currently, the code in the NNS proposal is not legible before the NNS vote, but we are very close to addressing the technical issue that blocks this. Very soon, all NNS proposals will have the code changes along with the binary. The team has been working for a while on this as a high priority item.

  3. The intent is that when an NNS proposal is proposed for blessing/electing, I want to surface it so people can take a look or ask questions. Just because the code is not visible beforehand, it does not mean I should not try to get as much visibility on them. Indeed, the foundation typically waits to vote on these in case people bring up issues.

  4. I am human so I am too slow to make these threads. I am trying to get better at this.

Changelog of what is in this new version:

  • Boundary nodes: TLS cert update fixes
  • Boundary nodes: Cache query calls in nginx
  • Consensus: Update consensus ECDSA payload types
  • Consensus: Fix problems when a node is joining a busy subnet
  • Crypto: Implement TLS client handshake using rustls
  • Crypto: Improve client and server certificate verification
  • Crypto: Refactor IDKG API
  • Crypto: Improve Threshold Signature benchmarks
  • Crypto: Initial implementation of MEGa encryption for IDKG
  • Crypto: Update zkcrypto/pairing dependencies
  • Execution: Introduce per canister heap delta limit to share the heap delta capacity in a fairer manner between canisters.
  • Execution: reduce the instruction limit for executing install_code messages on dedicated subnets. It was initially set to a higher value to support specific use cases but after looking at data from recent install_code requests the value is safe to be lowered to the same value as non dedicated subnets.
  • Execution: Adjust cost of various system API calls to better reflect the actual amount of work done in these calls.
  • Execution: Use threadpool::ThreadPool instead of rayon for query handling. Rayon shares thread pools underneath the hood which can cause deadlocks.
  • Execution: Use the same instruction limit for executing queries as update messages. Before queries were using a separate constant that had the problem of becoming out of date when the limit for update messages was updated.
  • Execution: Reclaim allocated memory on failed message execution. When a message execution fails, we undo the changes it made, previously we were not adjusting how much memory subsequent executions had available. This meant that subsequent message executions in the round had less memory available.
  • Execution: Disable the new signal handler on Windows WSL as it is not properly supported.
  • Execution: When executing queries on wasm modules which have not yet been compiled to native, prevent multiple compilation processes.
  • Messaging: Track memory consumed by in-flight canister messages
  • Net: Remove legacy api/v1 HTTP endpoints
  • Net: Add rate limiting per connection
  • Net: Add buffering, rate limiting and concurrency limit on the ingress ingestion service
  • Net: Share the threadpool between query execution and the ingress message filter
  • Node: Rework wasm compilation cache
  • Node: GuestOS SELinux policy
  • Node: Basic CPU profiling with pprof
  • P2P: ECDSA message updates, artifact pool improvements
  • Various bugfixes and test updates
5 Likes
  • Boundary nodes: Cache query calls in nginx

Awesome.

2 Likes

Thanks @diegop I enjoyed reading a technical proposal for the first time, keep up the good work.

1 Like

Thanks for sharing, and also thanks to the team for the great work!

Query calls have parameters and callers. Does the boundary node take them into account when caching? (If not I see a risk that one could trick the boundary node to return a “wrong” response.)

Or is this really about the HTTP gateway on the boundary nodes (i.e. “cache HTTP requests to canisters”)?

Does this introduce a new way for canister to trap at possibly any point, similar but distinct from running out of cycles? More details here might be good – for careful crafted applications it is important to understand where they can possibly trap, and how to avoid that.

1 Like

I am deferring the first question about query caching to @PaulLiu

Regarding the heap delta rate limiting: individual message execution remains the same as before, so there are no new traps. A message can modify several GBs of memory and thus exceed the limit. In such a case the canister will skip subsequent rounds until the rate limited heap delta catches up with the actual usage.

3 Likes

The caching is for query responses. It uses most fields in the request body except expiry & nonce to make up the cache key to avoid giving wrong responses.

2 Likes

Thanks! What is the cache expiry time? (If it it is too long, clients validating certificates in the response, e.g. certified variable, might think they are victim of a replay attack.)

2 Likes

It is 1 second. I wrote more about this as an update in another thread High User Traffic Incident Retrospective - Thursday September 2, 2021 - #49 by PaulLiu

2 Likes

Great, thanks, that sounds good and harmless :slight_smile:

1 Like