6/8/22’s Multi-Continent IC Outage - Are Boundary Nodes the top attack vector for the IC?

Today’s multi-continent outage on the IC has inspired me ask more questions about the resiliency of the IC’s Boundary Nodes.

If the failure of a single Boundary Node, that was not under attack of any kind can disrupt users in half of the US, Latin America, and Western Europe (and that is just what was reported), then…

I would argue that reinforcing the boundary node infrastructure is a P1 issue (all hands on deck), and more important than any other work happening on the IC at this very moment (yes, even Bitcoin Integration and the SNS).

These statements don’t seem to be consistent - or if they are, I’m a bit disappointed in the resilience of the Boundary Node infrastructure.

Here’s a few questions that come to mind:

  • How do all IC requests coming from users in Minneapolis to Latin America to Western Europe go down because a single Boundary Node goes down in US-east?

  • What are the regions that boundary nodes are running in?

  • How many Boundary Nodes are being run in total? 1-2 in each continent? https://dashboard.internetcomputer.org shows only 1 boundary node in all of North America.

  • Why should developers feel like their applications and the IC is “infinitely scalable” and “more resilient than AWS” if there is only one boundary node in the US?

  • Are there plans to scale out boundary nodes (centralized through DFINITY or through external parties)?

  • Where are the Boundary Nodes being run? What is the infrastructure that they are running on? AWS? GCP? Independent data centers? DFINITY owned hardware?

Tagging DFINITY team members that have spoken before regarding Boundary Nodes for visibility:
@Daniel-Bloom @faraz.shaikh @Jan @rrkapitz @yotam

11 Likes

Your eyes do not deceive you. This should not happen. We agree. We can describe the state of the world, but you should know that we have been diving deep into BNs and their future. We believe we have a good plan (many engineering managers and directors and even Jan and Dom studied the issue together for hours in Zürich last week), but we are working out the kinks before publishing it. Boundary nodes are a work in progress and we welcome feedback like this even while we are working on improving things.

TL;DR: Operations and availability of a single boundary node in no way dictate the availability or operations of the IC. Boundary infra-structure is stateless and fault-tolerant.

Today’s incident was the fallout of an issue with the boundary node fail-over mechanism.
Two recently added boundary nodes refused to failover, after a misconfigured deployment.
We understand why failover wasn’t triggered and have our work cut out.

Some quick answers to your question. Please expect a detailed postmortem once we have it in a form ready for public consumption:

Q How do all IC requests coming from users in Minneapolis to Latin America to Western Europe go down because a single Boundary Node (BN) goes down in US-east?
A. Every region is backed by a pool (3+) of boundary nodes. Each pool has a backup pool of last resort, and eventually, there is a boundary node of last resort. You are right in questioning why requests went to a single node. In today’s incident, flaws in heartbeat detection logic prevented us from failing over to the next BN. The fix deployed was to trigger a manual failover by taking the offending boundary node offline.

Q What are the regions that boundary nodes are running in?
A. Region is an “optional” abstraction, i.e. division into region doesn’t divide the availability and fault tolerance of boundary nodes. i.e. you can access any boundary node at any time from anywhere. DNS resolution of the ic0.app can actually be controlled by the end-user to point to the desired boundary node. The dashboard should give you a list of boundary nodes irrespective of their parent regions; if you set the name resolution of ic0.app to the IP of one of the boundary nodes you will always go to the same BN.

Q How many Boundary Nodes are being run in total? 1-2 in each continent? shows only 1 boundary node in all of North America.
A. Please see the answer above on regions. We have 20+ boundary nodes (20 are active and more are on standby). There is ongoing work to reduce the onboarding overhead for new boundary nodes such that community members can run boundary nodes.

Q Why should developers feel like their applications and the IC is “infinitely scalable” and “more resilient than AWS” if there is only one boundary node in the US?
There are many boundary nodes – please see the answer above.
Dfinity is invested and committed to the reliability, availability and scalability of the boundary nodes infrastructure.
The final goal is community-owned infrastructure.

Q Are there plans to scale out boundary nodes (centralized through DFINITY or through external parties)?
A See answer above

Q Where are the Boundary Nodes being run? What is the infrastructure that they are running on? AWS? GCP? Independent data centers? Dfinity-owned hardware?
A Independent data center + Dfinity owned data-centers and hardware

Cheers,
The Boundary Nodes team

14 Likes

@martin_DFN1

Thank you for the prompt response.

Looking forward to reading the postmortem - glad to hear yesterday’s issue was an implementation bug and not a design flaw.

Also, very interested to hear more about the future of boundary nodes on the IC. Many developers are wondering how they can scale up and balance load, as well as throttle and rate-limit principals to prevent DDOS/cycle drain attacks.

I believe the boundary nodes have an incredibly important part to play in the future of the IC, and I look forward to DFINITY revealing more about what that part will look like.

2 Likes

For visibility @martin_DFN1 is a senior engineering manager at Dfinity and leads boundary node team.

6 Likes

@martin_DFN1

Why does https://dashboard.internetcomputer.org/ (as of the timestamp of this post) then show a count of 13 boundary nodes?

Is the public facing dashboard out of date?

I assumed all of the numbers coming from the Internet Computer Dashboard were powered by a live internal API and not hardcoded.

1 Like

There is a visionary plan under Proposal 35671

We are working on a concrete roadmap, that is too early to share at the moment. We have to balance the powers that node providers have with the security interests of the IC. The crypto people are working on how this can be achieved; and we have to work with all the other teams’ schedules. It’s like threading a needle while changing the engines on a flying airplane.

1 Like

Ooh, good question. There is a lot of deployment going on at the moment and while we have many machines sitting around it wasn’t quite clear at the time of writing how many are actually processing requests. Once we have the new BN-VMs fully tested there will be more appearing as quickly as finance allows and we have Ops capacity.

1 Like

Thanks for raising this issue. One boundary node serving all of North America… I live in NA and it’s impressive my requests to mainnet get resolved as quickly as they do, given that a single node is handling all of that.

Once we add more BNs, I’m hoping we can also increase the per-subnet, per-BN rate limits we apply right now.

Looking forward to more updates on this front, including decentralization of BNs.

2 Likes

It’s actually more like three nodes serving the NA US. And if those fail or slow down the traffic is directed to other nodes. The problem this post is about, occurred because one of the BNs started to fail in a component that affects processing but not the heartbeat that the failover is based upon. A heartbeat should quickly execute the main components and deliver a healthy/sick verdict. So in this case the failover mechanism kept routing to the failed node We’ll improve the heartbeat functionality.

4 Likes

Update:

@martin_DFN1 has posted an incident retrospective here:

1 Like