Post Mortem: Faulty API BN upgrade on December 3, 2025

Summary

On December 3, 2025, between 15:15 UTC and 15:43 UTC, a critical incident occurred during the upgrade of the API boundary nodes. All API boundary nodes were not able to establish incoming connections and serve requests. This blocked access to all services hosted on the Internet Computer (e.g., Internet Identity, NNS dapp, OpenChat, etc.), despite the core network continuing to function normally.

The root cause of the incident was a change in the semantics of the ACME ALPN module, which is used by the API boundary nodes to obtain a valid certificate. Due to this change, the API boundary nodes failed to load the certificate chain and could therefore not establish TLS connections with any client.

Once the issue was discovered, the API boundary nodes were rolled back to the previous release and were operational again.

Actions Taken

Immediate actions included roll back of the API boundary nodes to the previous version. This process involved:

  • Identifying the issue and confirming that a rollback will resolve it.

  • Submitting an NNS proposal to rollback the first 5 API boundary nodes.

  • Submitting an NNS proposal to rollback the remaining 15 API boundary nodes.

After the incident was mitigated, further actions were taken to prevent the issue from reappearing:

  • A fix was proposed (see fe32744), which was included in the next release (d2a13f0).

  • Additional alerts were added to catch issues before they cause an IC-wide outage.

  • A new test was added to actively test the ACME-ALPN module (see 7279867).

  • Active probing has been added to the rollout monitoring.

Timeline of Events

(all times in UTC)

What went well

  • The issue was quickly identified.

  • The relevant teams were ready and available to quickly submit the rollback proposals.

Lessons learned

  • The only alerts that triggered came too late, when the issue was already in effect. It could have been observed much earlier (e.g., after the first API BN upgrade).

  • Currently, there are no tests that cover the ACME-ALPN module in the API BN. Testing relies on a different code path.

Technical Details

The guestOS release 724ae4101bfdd8d4443126a6a8b1ec5ca9b68a12 contained a change to ic-boundary to update the ic-bn-lib dependencies to the latest version (3285193).

The update included a change to the semantics of the ACME-ALPN module, which slipped through. Because of that, the API boundary node could not load its TLS certificate chain and failed to accept incoming connections, effectively blocking access to all clients.

API BN guestOS releases are usually rolled out on Wednesdays in a rolling fashion distributed over the entire day. After upgrading a single API BN, the rollout automation stops and waits for any alerts for that specific node to disappear. However, no alert or probing was in place to catch the issue.

The HTTP gateways fetch the list of all API BNs from the registry. Then, they health check all of them and select the 5 best API BNs from their perspective to route the requests to. As the upgrade was rolled out to more and more API BNs, the HTTP gateways had less and less healthy API BNs available until the very last API BN was upgraded. This is clearly visible in the following graph that shows the number of requests handled by each API BN:

From that moment on, the HTTP gateways had no healthy upstream left. This lasted for almost 30mins. The following graph shows the status codes of the responses served by the HTTP gateways:

image

Thank you for this impressive deep dive. Very much appreciated RĂĽdiger.

P.S.

Coincidence ? :slight_smile:
Thanks for keeping an eye out @Lorimer and @timk11 . And also huge thanks to @alexu for following up.
https://forum.dfinity.org/t/proposal-to-elect-new-release-rc-2025-12-04-03-28/60956/18?u=zackds
Hopefully future commits related to BN will be double checked.

Thanks for the report.
If we were to draw conclusions, it would seem that the moral of the story is to check Grafana while deploying updates to BN aha This could have been caught by looking at the first image you shared, right?
Well done for the quick response time in any case, and thanks for the post mortem.

LOL the screenshots of the three dapps down have one developer in common, I feel seen :rofl:

Thanks for the post-mortem and the super quick resolution!

Indeed, there are multiple metrics/dashboards that would have shown the issue. That’s why we have extended the alerting to trigger if:

  • an HTTP gateway sees less than 3 healthy API BNs
  • more than 10 API BNs do not receive any traffic for some time
  • the observability tool fails to establish a connection to the public endpoints of an API BN

Nice ! And a double nice because this forum need at least 20 chars ! :joy:

Thanks for this writeup @rbirkner

Can I ask why rollouts are not distributed over the entire week (a week being the typical cadence of new GuestOS versions)?

The deployment of new GuestOS versions to subnets is distributed over a week or more. This seems much more appropriate, and also stands a better chance of taking advantage of vigilant members of the community before an issue with a particular release can lead to system-wide failure (something a decentralised system should be designed and/or managed to mitigate against).

Can I ask why rollouts are not distributed over the entire week (a week being the typical cadence of new GuestOS versions)?

Good point/question!

The main constraint for API BNs is that they act as the public interface for the IC. To ensure a smooth developer/user experience, we have prioritized quick rollouts, trying to make new features available everywhere almost simultaneously to avoid breaking changes or inconsistent states for clients.

That said, your point about mitigation and stability is spot on. We are looking into adapting the rollout pipeline to include a canary phase, updating one or two nodes on Monday or Tuesday and then let them “bake” for 24 hours. If that signals health, we propose to proceed with the rest. It is actually similar to how we handle the DFINITY-operated HTTP Gateways, and it might solve exactly the issue you are highlighting. Thanks for the push! We will look into this for the rollouts in the new year.

Mr. RĂĽdiger Birkner,

First thanks for explaining, the DFINITY team recovered quite fast from this crash. It seems the Automated Certificate Management Environment (ACME) TLS Application‑Layer Protocol Negotiation (ALPN) is an RFC which I assume you implemented in Rust, and the way it was supposed to be called was changed.

Is it correct for me to assume that given your explanation the whole IC crash was due to a change in the way a function was called in the Rust code? It seems that function is the one at lines 734 and 735 in the screenshot, the other lines seem to be related to this one:

If that was not the case, could you please go over the code you referred to us as the fix here:

And explain exactly the fix. I think it is quite important to notice the small change that triggered the IC crash.

Thank you.

Hey @josephgranata

Sorry for the late reply! I was out for some days. Now to your actual post:

First of all, I want to address this point and the wording:

Is it correct for me to assume that given your explanation the whole IC crash was due to a change in the way a function was called in the Rust code?

It was not the whole IC that crashed and nothing “crashed”. All the subnets, all the replicas were working fine as usual. “Only” the API BNs were affected and prevented access to the IC. So, while the IC was not accessible, it didn’t crash.

Now to your actual question: This is the change in ic-bn-libthat changed the behavior of the ACME ALPN module. Before that change, whenever you instantiated an AcmeAlpn, it would run on its own. With that change, it switched to the task pattern where AcmeAlpn implements Run and needs to be handed to a task manager, which actually runs it.

Thanks, indeed it was not a system crash, but the effect of what happened looked like a crash to an observer outside of DFINITY.

I appreciate the explanation of the code that was behind the issue as well, certainly not an easy thing to notice.

Happy upcoming New Year!

It’s great to see the quick identification and rollback, and that the team already added monitoring and tests for the ACME-ALPN module. The incident really highlights how critical early alerts and targeted testing are, especially for components like API BNs that can impact the whole network. Valuable lessons here for future rollouts.