Your eyes do not deceive you. This should not happen. We agree. We can describe the state of the world, but you should know that we have been diving deep into BNs and their future. We believe we have a good plan (many engineering managers and directors and even Jan and Dom studied the issue together for hours in Zürich last week), but we are working out the kinks before publishing it. Boundary nodes are a work in progress and we welcome feedback like this even while we are working on improving things.
TL;DR: Operations and availability of a single boundary node in no way dictate the availability or operations of the IC. Boundary infra-structure is stateless and fault-tolerant.
Today’s incident was the fallout of an issue with the boundary node fail-over mechanism.
Two recently added boundary nodes refused to failover, after a misconfigured deployment.
We understand why failover wasn’t triggered and have our work cut out.
Some quick answers to your question. Please expect a detailed postmortem once we have it in a form ready for public consumption:
Q How do all IC requests coming from users in Minneapolis to Latin America to Western Europe go down because a single Boundary Node (BN) goes down in US-east?
A. Every region is backed by a pool (3+) of boundary nodes. Each pool has a backup pool of last resort, and eventually, there is a boundary node of last resort. You are right in questioning why requests went to a single node. In today’s incident, flaws in heartbeat detection logic prevented us from failing over to the next BN. The fix deployed was to trigger a manual failover by taking the offending boundary node offline.
Q What are the regions that boundary nodes are running in?
A. Region is an “optional” abstraction, i.e. division into region doesn’t divide the availability and fault tolerance of boundary nodes. i.e. you can access any boundary node at any time from anywhere. DNS resolution of the ic0.app can actually be controlled by the end-user to point to the desired boundary node. The dashboard should give you a list of boundary nodes irrespective of their parent regions; if you set the name resolution of ic0.app to the IP of one of the boundary nodes you will always go to the same BN.
Q How many Boundary Nodes are being run in total? 1-2 in each continent? shows only 1 boundary node in all of North America.
A. Please see the answer above on regions. We have 20+ boundary nodes (20 are active and more are on standby). There is ongoing work to reduce the onboarding overhead for new boundary nodes such that community members can run boundary nodes.
Q Why should developers feel like their applications and the IC is “infinitely scalable” and “more resilient than AWS” if there is only one boundary node in the US?
There are many boundary nodes – please see the answer above.
Dfinity is invested and committed to the reliability, availability and scalability of the boundary nodes infrastructure.
The final goal is community-owned infrastructure.
Q Are there plans to scale out boundary nodes (centralized through DFINITY or through external parties)?
A See answer above
Q Where are the Boundary Nodes being run? What is the infrastructure that they are running on? AWS? GCP? Independent data centers? Dfinity-owned hardware?
A Independent data center + Dfinity owned data-centers and hardware
Cheers,
The Boundary Nodes team