Summary
On December 3, 2025, between 15:15 UTC and 15:43 UTC, a critical incident occurred during the upgrade of the API boundary nodes. All API boundary nodes were not able to establish incoming connections and serve requests. This blocked access to all services hosted on the Internet Computer (e.g., Internet Identity, NNS dapp, OpenChat, etc.), despite the core network continuing to function normally.
The root cause of the incident was a change in the semantics of the ACME ALPN module, which is used by the API boundary nodes to obtain a valid certificate. Due to this change, the API boundary nodes failed to load the certificate chain and could therefore not establish TLS connections with any client.
Once the issue was discovered, the API boundary nodes were rolled back to the previous release and were operational again.
Actions Taken
Immediate actions included roll back of the API boundary nodes to the previous version. This process involved:
-
Identifying the issue and confirming that a rollback will resolve it.
-
Submitting an NNS proposal to rollback the first 5 API boundary nodes.
-
Submitting an NNS proposal to rollback the remaining 15 API boundary nodes.
After the incident was mitigated, further actions were taken to prevent the issue from reappearing:
-
A fix was proposed (see fe32744), which was included in the next release (d2a13f0).
-
Additional alerts were added to catch issues before they cause an IC-wide outage.
-
A new test was added to actively test the ACME-ALPN module (see 7279867).
-
Active probing has been added to the rollout monitoring.
Timeline of Events
(all times in UTC)
-
07:28: Proposal 139611 is submitted to upgrade the first API BN to the version 724ae4101bfdd8d4443126a6a8b1ec5ca9b68a12
-
15:15: Proposal 139645 is submitted to upgrade the final API BN to version 724ae4101bfdd8d4443126a6a8b1ec5ca9b68a12
-
15:15: The HTTP gateways have no healthy upstream left.
-
15:15: Several alerts fired.
-
15:18: First issues have been reported internally.
-
15:20: Node team starts the investigation
-
15:22: First report on the forum
-
15:38: The decision is made to rollback the upgrade.
-
15:42: Proposal 139646 is submitted to rollback the first 5 API BNs to version 948d5b9260494ec3e6c9bc9db499f34d52ba6c7f
-
15:43: The API BNs rolled back and IC is reachable again, no more errors
-
15:59: Proposal 139647 is submitted to rollback the remaining 15 API BNs to version 948d5b9260494ec3e6c9bc9db499f34d52ba6c7f
What went well
-
The issue was quickly identified.
-
The relevant teams were ready and available to quickly submit the rollback proposals.
Lessons learned
-
The only alerts that triggered came too late, when the issue was already in effect. It could have been observed much earlier (e.g., after the first API BN upgrade).
-
Currently, there are no tests that cover the ACME-ALPN module in the API BN. Testing relies on a different code path.
Technical Details
The guestOS release 724ae4101bfdd8d4443126a6a8b1ec5ca9b68a12 contained a change to ic-boundary to update the ic-bn-lib dependencies to the latest version (3285193).
The update included a change to the semantics of the ACME-ALPN module, which slipped through. Because of that, the API boundary node could not load its TLS certificate chain and failed to accept incoming connections, effectively blocking access to all clients.
API BN guestOS releases are usually rolled out on Wednesdays in a rolling fashion distributed over the entire day. After upgrading a single API BN, the rollout automation stops and waits for any alerts for that specific node to disappear. However, no alert or probing was in place to catch the issue.
The HTTP gateways fetch the list of all API BNs from the registry. Then, they health check all of them and select the 5 best API BNs from their perspective to route the requests to. As the upgrade was rolled out to more and more API BNs, the HTTP gateways had less and less healthy API BNs available until the very last API BN was upgraded. This is clearly visible in the following graph that shows the number of requests handled by each API BN:
From that moment on, the HTTP gateways had no healthy upstream left. This lasted for almost 30mins. The following graph shows the status codes of the responses served by the HTTP gateways:



