Incident Handling with the New Boundary Node Architecture

Hello everyone,

As we are getting closer and closer to switch over to the new boundary node architecture and propose to spin up more API boundary nodes under the control of the NNS, we also need to think about disaster recovery and incident handling. Back in March 2022, the NNS adopted the Security Patch Policy and Procedure, which covers how the DFINITY foundation handles security patches to the Internet Computer Platform and the system canisters. This also covers the API boundary nodes, but it does not cover two important aspects of handling incidents:

  1. How API boundary nodes can be recovered as submitting and voting on proposals happens through them;
  2. How rate-limits and filters can be applied to the API boundary nodes to protect vulnerable parts of the system during an incident response.

In the following, we provide some context on how these have been handled until now and outline what measures we propose going forward for the new boundary node architecture. We are looking forward to a constructive discussion, your questions and feedback.

Aspect 1: API Boundary Node Recovery

Context

Fortunately, the IC has never had a situation so far, during which all boundary nodes were unavailable. Still, a mitigation for this scenario needs to be ready, i.e. the IC needs to be able to rollback a broken API boundary node release without having to rely on the API boundary nodes for submitting the proposal and voting on it.

Until now, this has not been an issue as the boundary nodes were operated by DFINITY and not yet NNS-managed. Hence, any outage of the boundary nodes could be remediated without an NNS proposal and vote. During all the incidents, it was possible to vote through the boundary nodes. Going forward, this might not be the case.

Proposed Solution: Allowlisting specific IPs

To retain the ability of updating API boundary nodes without reliance on the API boundary nodes themselves, we propose to allowlist certain IP addresses in the firewall that can directly connect to NNS replicas in order to submit proposals and vote independently of the API boundary nodes. DFINITY will submit a proposal to allowlist a few of its IP addresses. This would allow DFINITY to spin up an API boundary node that can allow voting in the case of an emergency.

Note, to further minimize the risk of API boundary node updates resulting in a loss of IC access, we also propose rolling upgrades, similar to the subnet upgrades (e.g., the NNS subnet is only upgraded a week after the very first subnet).

Aspect 2: Rate-Limiting as part of incident response

Context

In the past, several incidents required applying rate limiting and filtering rules on certain requests at the boundary nodes to protect vulnerable canisters or to reduce the load hitting a subnet. These rules were applied temporarily until the incident was resolved.

Two examples of such incidents are the ic_cdk memory leak and the low finalization rate of the NNS subnet:

  • During the memory leak incident, we introduced rate-limits to protect the vulnerable canisters and “buy time” to roll-out a fix. In particular, we added rate-limits for all calls to the get_sns_canisters_summary of all SNS root canisters.
  • During the low finalization rate incident, we temporarily blocked all ingress messages to the NNS subnet in order to reduce the load for a quick rollback.

Since the boundary nodes were so far controlled by the DFINITY foundation, we directly applied these rate-limits and filters. Now that the API boundary nodes will be under the control of the NNS, this cannot be done like before but requires a new mechanism. This mechanism should:

  • allow for fast response;
  • not directly rely on NNS governance and the NNS subnet (as governance could be affected by the incident);
  • keep the applied measures private during the incident;
  • allow the community to inspect the applied measures after the incident is resolved.

Proposed solution: Rate-limiting canister

We propose to create a rate-limiting canister residing in the uzr34-subnet, which also hosts other protocol canisters (e.g., Internet Identity, cycles ledger, etc.). The rate-limiting canister contains the rate-limiting configuration, which the API boundary nodes regularly fetch and apply. The canister is managed through the NNS and specific authorized principals can update the rate-limiting configuration. After an incident, the rules are disclosed.

Control

A few authorized principals are given the capability of adding and enforcing rate-limit and filtering rules. This centralized access control is necessary to handle incidents quickly without increasing the attack surface. The goal is to limit this capability as much as possible. Hence, we propose to put the canister under NNS control, such that the NNS sets bounds through the canister code and the authorized principals can only configure rules within these bounds. The authorized principals are decided on by the NNS.

Rate-limiting and filtering rules

The rate-limiting rules consist of two parts: a condition and an action. Currently, the condition is made up of four request features: a canister ID, subnet ID, method name (regular expression), request type. Omitting one feature is being treated as a wildcard. The action can either be a rate-limit or a block.

For example, during the memory leak incident, we used the following rule that limited all update calls for the get_sns_canisters_summary of the OpenChat root canister to 1 request per minute:

- canister_id: 3e3x2-xyaaa-aaaaq-aaalq-cai
  methods: ^get_sns_canisters_summary$
  request_type: call
  limit: 1/1m

Auditability

During an incident, the applied rules are hidden and are only accessible to the API boundary nodes. After the incident is over and all vulnerabilities have been patched, one of the authorized principals will disclose the rules in the canister and anybody can check what rules have been applied (also retrospectively). The canister is append-only such that the authorized principals can only enforce new configurations, but not remove/modify existing ones. In order to “remove” currently active rate-limiting rules, a new configuration has to be pushed that doesn’t contain these rules anymore. The canister code will be publicly available (part of the IC repo) and the canister history can be checked by anyone.

Approach

We propose to proceed in three steps:

  • First, DFINITY will create a motion proposal to get approval from the community to pursue this approach, and develop such a canister and the required tooling.

  • After finishing the development of the canister, DFINITY will submit the necessary proposals to install the canister on the uzr34 subnet and put it under NNS control.

  • Finally, the DFINITY foundation will submit a proposal to authorize its own principal to update the rate-limiting configuration. With that proposal the DFINITY foundation also reinforces that it will apply the Security Patch Policy and Procedure accordingly to the rate-limiting canister: In particular, the DFINITY foundation will use the rate-limits and filters only for incident handling. Due to the sensitive nature of the measures, it will keep them private until the entire system has been patched. The DFINITY foundation will disclose the details of the applied rate-limits and filters directly in the canister at most 10 days from the time after the successful application of the patches to the entire system.

Summary and Next Steps

We propose two measures to enable incident handling with the new boundary node architecture:

  1. allowlist IP addresses of the DFINITY foundation, enabling DFINITY to spin up an API boundary node in case of an incident such that proposals and votes can be submitted directly to the NNS subnet.
  2. create a rate-limiting canister to enforce rate-limits and filtering on the API boundary nodes and to authorize a principal of DFINITY to set these rate-limits and filtering configurations.

Based on the discussion in this thread, the plan is to first submit a motion proposal for the rate-limiting canister. If that proposal is approved and we are ready to switch to the new architecture, we will submit the other proposals to allowlist the DFINITY IP addresses and to authorize DFINITY’s principal.

We are very much looking forward to your thoughts, feedback and suggestions!

14 Likes

Does this mean hardcoding some DFINITY’s IP addresses in the replicas’ firewall settings via a proposal?

Does this imply that each API BN has an identity whose principal is stored in an allow-list on the rate-limiting canister? If yes, are these principals updated via a proposal whenever a BN is spinned up/recovered?

It would be nicer if these settings were automatically disclosed by the canister after a certain time. This “disclosure datetime” could be set together with the new rate-limiting settings.
This way, we don’t have to rely on DFINITY, or whoever in the future will be allowed to apply these settings, to disclose them.

Considering the fact that node providers can potentially read the state of a canister, in case of an incident they could take advantage of reading the canister’s state and disclosing the rate-limiting settings. Maybe this is another valid use case of the so long waited vetKeys?

1 Like

Thank @ilbert for all your excellent questions!

Does this mean hardcoding some DFINITY’s IP addresses in the replicas’ firewall settings via a proposal?

Yes, this means making a proposal just like this one, for example. I wouldn’t call it hardcoding though as the rules can be removed with just another proposal. And since it is a proposal, any other party can make a similar proposal with their IP address.

Does this imply that each API BN has an identity whose principal is stored in an allow-list on the rate-limiting canister? If yes, are these principals updated via a proposal whenever a BN is spinned up/recovered?

Exactly, each API BN is running on a normal node with a node id (its principal) and all these are stored in the registry. The rate-limiting canister is regularly fetching the updated list from the registry canister (we have added an endpoint called get_api_boundary_node_ids). Hence, this is automatically updated whenever a new API BN is deployed.

The DFINITY foundation will disclose the details of the applied rate-limits and filters directly in the canister at most 10 days from the time after the successful application of the patches to the entire system.

We have also thought about this and we didn’t find a good way for the canister to “know” when the vulnerability has been patched and the “wait for quiet” period (the 10 days) starts. At the time, when the rate-limits are put into place, it is hard to estimate how long it will take. For example, the memory leak affected among others the SNS canisters and we didn’t know how long it would take for them to upgrade once the fix was out. I would be curious if you have an idea on how to go about that.

Considering the fact that node providers can potentially read the state of a canister, in case of an incident they could take advantage of reading the canister’s state and disclosing the rate-limiting settings. Maybe this is another valid use case of the so long waited vetKeys?

Yes, this could indeed be a nice use-case for vetKeys. We decided to propose the pragmatic solution: node providers of the API BNs could anyway read the rate-limiting rules and most likely most node providers that run a node in the subnet will also run an API boundary node, hence protecting the rules in the canister wouldn’t really hide them. Also, any client can observe the rate-limiting rules as the API BNs return a 429. Of course the clients do not know what is exactly being rate-limited.

1 Like

Thanks for the explanations, @rbirkner! Now things look more clear.

Just throwing out there another idea: instead of using pre-defined timers, the rate-limiting canister could periodically poll the status of the NNS proposal containing the fix and disclose the rate-limiting settings once the proposal is fully executed and rolled out. The proposal id to watch could be set via a specific method (something like set_proposal_id : (rate_limiting_settings_id: nat64, proposal_id: nat64) -> ()) by the same principal that created the rate-limiting settings.
A similar approach could be taken when the security patches involve the SNS canisters, but that’s way more complex to monitor.
I know this introduces a lot of complexity in the rate-limiting canister though, but looks better under the light of making the IC more and more autonomous.

2 Likes

Hey @ilbert,

That’s a neat idea: let me repeat it just to make sure I understood :slight_smile:
In the course of the incident, once the remediation is clear and the proposal(s) have been submitted, one could manually “attach” a proposal to the rate-limiting rules. Once that proposal passes (and potentially some delay), the canister would then automatically disclose the rules.

We would still need a way to manually disclose in case the incident resolution does not involve an NNS proposal (or the solution would have to be so generic to also “listen” to SNS proposals).

Indeed, this introduces quite a bit more complexity and also reliance on other canisters (NNS governance and maybe also SNS governance). For a first version, I would stick to the disclosure through one of the authorized principals and then look into a more automated way for a next iteration. What do you think?

2 Likes

You got it!

I agree the disclosing automation can be achieved at a later stage.

1 Like

Hello everyone,

the proposal is live now: https://dashboard.internetcomputer.org/proposal/134031
Please spread the word and vote!

1 Like

Good morning everyone,

The proposal has been adopted yesterday evening. This means that the boundary node team will now proceed with the work to on the necessary infrastructure and follow up with the proposals to set everything up.

1 Like

Hello everyone,

The development of the rate-limiting canister is nearly complete, and we are preparing to set it up. To proceed, we submitted proposal #134256 to authorize the boundary node team’s principal (2igsz-4cjfz-unvfj-s4d3u-ftcdb-6ibug-em6tf-nzm2h-6igks-spdus-rqe) to create the canister on the uzr34 system subnet. Once the canister is created, we will submit another proposal to revoke that authorization.

Initially, the boundary node team will retain control of the canister for final testing purposes. Before deploying the new boundary node architecture, the canister will be handed over to the NNS root and fully reinstalled to ensure a clean and secure transition.

3 Likes

Hi @rbirkner,

Thanks for this announcement, and the proposal. Can I ask where the source code for this canister is?

In an ideal world, my feeling is that an NNS proposal should be the mechanism for adding a canister to a private subnet (such as the II subnet). The WASM would be the payload and the canister would be reviewed, built and verified by members of the NNS DAO.

Would you be able to offer any comments on why this sort of approach is not feasible?

2 Likes

Hey @Lorimer

I fully agree with you that ideally we could make a proposal to directly install a canister in a subnet under the control of the NNS. Unfortunately, this is currently not supported as long as the canister is not installed in the NNS subnet. If the canister is being installed on the NNS subnet, one could use the NNS Canister Install proposal under the System Canister Management topic.

If one wants to set up a NNS controlled canister on another subnet, one has to follow this indirect approach:

  1. Authorize a prinicpal using a proposal on the target subnet to be able to create a canister.
  2. Use the principal to create a canister on the subnet. At this point that principal is the controller of the canister.
  3. Remove the authorization using a proposal.
  4. Add NNS root as a controller to the created canister and remove oneself.
  5. Then, one can make a proposal to upgrade the canister in reinstall mode, such that the canister state is completely wiped and reinstalled.

After these six steps one has a NNS controlled canister with clean state. This approach is a bit cumbersome, but unfortunately, currently the only way as there hasn’t been time to extend the approach to install canisters on the NNS subnet. The same approach as we intend to take for the rate-limiting canister was also taken when the cycles ledger canister was installed (refer to this thread).

Let me know if that makes sense.

2 Likes

Thanks for confirming the situation @rbirkner. This is very useful context.

I really value the work that has gone into this proposal, and I’m also very much looking forward to decentralisation of boundary node control (awesome work!). However, I can’t adopt a proposal that unnecessarily removes the ability for voters to review and vote on what takes places within a system subnet. Given that the trust requirement is not essential, I see it as unjustified.

There’s no way for me to validate what a principal will do once they have privileges to deploy to a system subnet (there are certainly malicious things that could be done). This is problematic (after all, verifiability and absence/minimisation of trust is what this whole enterprise is about).

I’ve rejected Subnet Management proposals in the past for this sort of reason, after agreeing steps forward to improve verifiability. Here’s an example for reference.

Can I ask why this hasn’t been seen as a priority?

If I do reject this proposal, my main intention will be to escalate the need for a more robust and suitable deployment process (not necessarily to try and block this specific proposal from executing).

1 Like

I’ll hold of voting until later today, but for the moment I’m planning to reject this proposal.

I don’t know why it hasn’t been a priority so far as it is not in my area and I never encountered it until now as we started to plan the set up of the rate-limiting canister. I personally think it is mostly due to the fact that it was never really needed (apart from the cycles ledger canister).

As we were planning to set up the rate-limiting canister, we asked for this change, but the NNS team is currently busy with other work (e.g., periodic following confirmation, public & private neurons etc.). Hence, the boundary node team decided to move forward using the old approach.

I do understand your position and personally sympathize: as you explain in the other thread, rejecting such proposals can help keeping some visibility on the topic. From the perspective of the boundary node team, I obviously disagree and would wish that you vote to accept :wink:

1 Like

Review from subnet thread:

I voted to reject, for the reasons mentioned.

Thanks for explaining further @rbirkner. Have you considered if there are any alternative approaches that would allow the NNS to maintain oversight and control over what happens on the II subnet?

What about using the NNS Canister Install proposal under the System Canister Management topic to deploy the canister to the NNS subnet (once reviewed and approved by the NNS, as part of adopting that proposal). Would it then be possible to migrate the canister from the NNS subnet to the II subnet? Is there no type of proposal that can action this?

If not, can I ask why the II subnet has been chosen as the target subnet?

1 Like

@rbirkner will probably still answer this, but probably it is a similar thought process as for the cycles ledger. For the cycles ledger we wanted a system subnet so make it not consume cycles (looking back this should have been a much more public conversation…). That left us with two subnets: NNS and II subnets. The NNS subnet is already often under pretty heavy load (remember e.g. the inscription events?) and the cycles ledger itself has the risk of being a high-traffic target. We decided that the II subnet has more capacity to handle extra traffic and chose to deploy it there.

2 Likes

Why the II subnet: The reasons are the same as the ones @Severin laid out for the cycles ledger canister. In short: system subnet, independent of NNS governance and ICP ledger.

Other approaches: In my opinion, there are only two viable approaches:

  1. Creating a new proposal that allows to install canisters on non-NNS subnets.
  2. The proposed approach of authorizing a principal, installing the canister, removing the authorization of the principal and handing the canister over to the NNS.

Installing the canister first on the NNS subnet and then migrating it is a bit of an overkill in my opinion. It has been done once for the Internet Identity canister. That canister used to be on the NNS subnet and was moved to the II subnet on October 26 2022 (see the corresponding forum thread). Migrating a canister requires altering the routing table. The routing table is a mapping of subnet IDs to canister ranges. When the II canister moved, the NNS canister range [start, end] was split into two [start, II canister ID - 1] and [II canister ID + 1, end] and the II canister ID was added as a new range to the II subnets ranges. This leads to “fragmentation” and I would try to avoid it if not absolutely necessary.

Thanks @rbirkner. Do you see this as overkill simply because of the routing table fragmentation? What are the tangible downsides of that fragmentation?

This doesn’t sound like overkill to me. Many consider Web3 in general to be overkill. If DAOs offer themselves up to be willingly subverted whenever there’s a need to expedite some dev work, or simply to avoid splitting a routing table, this is unlikely to do much to convince the Web3 naysayers that this is anything but theatre.

I don’t see it that way, but I can’t disagree that they’d have a point.

1 Like

Yes, (from the point of the boundary nodes) routing table fragmentation is not good as it makes routing lookups more complicated. Of course a single moved canister doesn’t really matter, but it should not become a habit if one can avoid it.

“overkill” is probably a bit of an exaggeration and I didn’t articulate my opinion properly. I just feel that it is a complex approach (among others, it also requires multiple proposals) and has the downside of fragmentation. Moving a canister hasn’t been done since that II move and if I remember correctly, it was quite some work: the involved teams tested the move multiple times until they felt confident that everything will work smoothly. Since then, the IC has changed quite a bit and we would probably have to invest again significant resources into preparing everything for the move.

Hence, instead of investing the resources into such a work-around, I would much rather invest them into the solution that you proposed and that is needed.

1 Like