Hello everyone,
As we are getting closer and closer to switch over to the new boundary node architecture and propose to spin up more API boundary nodes under the control of the NNS, we also need to think about disaster recovery and incident handling. Back in March 2022, the NNS adopted the Security Patch Policy and Procedure, which covers how the DFINITY foundation handles security patches to the Internet Computer Platform and the system canisters. This also covers the API boundary nodes, but it does not cover two important aspects of handling incidents:
- How API boundary nodes can be recovered as submitting and voting on proposals happens through them;
- How rate-limits and filters can be applied to the API boundary nodes to protect vulnerable parts of the system during an incident response.
In the following, we provide some context on how these have been handled until now and outline what measures we propose going forward for the new boundary node architecture. We are looking forward to a constructive discussion, your questions and feedback.
Aspect 1: API Boundary Node Recovery
Context
Fortunately, the IC has never had a situation so far, during which all boundary nodes were unavailable. Still, a mitigation for this scenario needs to be ready, i.e. the IC needs to be able to rollback a broken API boundary node release without having to rely on the API boundary nodes for submitting the proposal and voting on it.
Until now, this has not been an issue as the boundary nodes were operated by DFINITY and not yet NNS-managed. Hence, any outage of the boundary nodes could be remediated without an NNS proposal and vote. During all the incidents, it was possible to vote through the boundary nodes. Going forward, this might not be the case.
Proposed Solution: Allowlisting specific IPs
To retain the ability of updating API boundary nodes without reliance on the API boundary nodes themselves, we propose to allowlist certain IP addresses in the firewall that can directly connect to NNS replicas in order to submit proposals and vote independently of the API boundary nodes. DFINITY will submit a proposal to allowlist a few of its IP addresses. This would allow DFINITY to spin up an API boundary node that can allow voting in the case of an emergency.
Note, to further minimize the risk of API boundary node updates resulting in a loss of IC access, we also propose rolling upgrades, similar to the subnet upgrades (e.g., the NNS subnet is only upgraded a week after the very first subnet).
Aspect 2: Rate-Limiting as part of incident response
Context
In the past, several incidents required applying rate limiting and filtering rules on certain requests at the boundary nodes to protect vulnerable canisters or to reduce the load hitting a subnet. These rules were applied temporarily until the incident was resolved.
Two examples of such incidents are the ic_cdk memory leak and the low finalization rate of the NNS subnet:
- During the memory leak incident, we introduced rate-limits to protect the vulnerable canisters and “buy time” to roll-out a fix. In particular, we added rate-limits for all calls to the
get_sns_canisters_summary
of all SNS root canisters. - During the low finalization rate incident, we temporarily blocked all ingress messages to the NNS subnet in order to reduce the load for a quick rollback.
Since the boundary nodes were so far controlled by the DFINITY foundation, we directly applied these rate-limits and filters. Now that the API boundary nodes will be under the control of the NNS, this cannot be done like before but requires a new mechanism. This mechanism should:
- allow for fast response;
- not directly rely on NNS governance and the NNS subnet (as governance could be affected by the incident);
- keep the applied measures private during the incident;
- allow the community to inspect the applied measures after the incident is resolved.
Proposed solution: Rate-limiting canister
We propose to create a rate-limiting canister residing in the uzr34
-subnet, which also hosts other protocol canisters (e.g., Internet Identity, cycles ledger, etc.). The rate-limiting canister contains the rate-limiting configuration, which the API boundary nodes regularly fetch and apply. The canister is managed through the NNS and specific authorized principals can update the rate-limiting configuration. After an incident, the rules are disclosed.
Control
A few authorized principals are given the capability of adding and enforcing rate-limit and filtering rules. This centralized access control is necessary to handle incidents quickly without increasing the attack surface. The goal is to limit this capability as much as possible. Hence, we propose to put the canister under NNS control, such that the NNS sets bounds through the canister code and the authorized principals can only configure rules within these bounds. The authorized principals are decided on by the NNS.
Rate-limiting and filtering rules
The rate-limiting rules consist of two parts: a condition and an action. Currently, the condition is made up of four request features: a canister ID, subnet ID, method name (regular expression), request type. Omitting one feature is being treated as a wildcard. The action can either be a rate-limit or a block.
For example, during the memory leak incident, we used the following rule that limited all update calls for the get_sns_canisters_summary of the OpenChat root canister to 1 request per minute:
- canister_id: 3e3x2-xyaaa-aaaaq-aaalq-cai
methods: ^get_sns_canisters_summary$
request_type: call
limit: 1/1m
Auditability
During an incident, the applied rules are hidden and are only accessible to the API boundary nodes. After the incident is over and all vulnerabilities have been patched, one of the authorized principals will disclose the rules in the canister and anybody can check what rules have been applied (also retrospectively). The canister is append-only such that the authorized principals can only enforce new configurations, but not remove/modify existing ones. In order to “remove” currently active rate-limiting rules, a new configuration has to be pushed that doesn’t contain these rules anymore. The canister code will be publicly available (part of the IC repo) and the canister history can be checked by anyone.
Approach
We propose to proceed in three steps:
-
First, DFINITY will create a motion proposal to get approval from the community to pursue this approach, and develop such a canister and the required tooling.
-
After finishing the development of the canister, DFINITY will submit the necessary proposals to install the canister on the uzr34 subnet and put it under NNS control.
-
Finally, the DFINITY foundation will submit a proposal to authorize its own principal to update the rate-limiting configuration. With that proposal the DFINITY foundation also reinforces that it will apply the Security Patch Policy and Procedure accordingly to the rate-limiting canister: In particular, the DFINITY foundation will use the rate-limits and filters only for incident handling. Due to the sensitive nature of the measures, it will keep them private until the entire system has been patched. The DFINITY foundation will disclose the details of the applied rate-limits and filters directly in the canister at most 10 days from the time after the successful application of the patches to the entire system.
Summary and Next Steps
We propose two measures to enable incident handling with the new boundary node architecture:
- allowlist IP addresses of the DFINITY foundation, enabling DFINITY to spin up an API boundary node in case of an incident such that proposals and votes can be submitted directly to the NNS subnet.
- create a rate-limiting canister to enforce rate-limits and filtering on the API boundary nodes and to authorize a principal of DFINITY to set these rate-limits and filtering configurations.
Based on the discussion in this thread, the plan is to first submit a motion proposal for the rate-limiting canister. If that proposal is approved and we are ready to switch to the new architecture, we will submit the other proposals to allowlist the DFINITY IP addresses and to authorize DFINITY’s principal.
We are very much looking forward to your thoughts, feedback and suggestions!