Postmortem: lhg73 subnet incident on Dec 26th 2024

Kamil · January 20, 2025, 5:21pm

Hello everyone,

Below you can find a detailed postmortem for the lhg73 subnet incident from 26th of December.

Summary

The subnet lhg73 stalled twice (2024-12-26 14:58 UTC and 2024-12-27 17:00 UTC) after nodes failed to agree on a block.

This happened because of a bug in the logic that reconstructs blocks from so-called “stripped blocks”. Stripped blocks are blocks which have the actual signed ingress messages replaced by short IDs uniquely identifying them to avoid having to send around too much data. On receiving a stripped block, nodes would attempt to reconstruct the full blocks by filling in the ingress messages based on their ID – first looking in their local artifact pools and only requesting the ingress messages they don’t have in their local pools from peers.

The root cause of the bug was an incorrect assumption about the used IDs. It was assumed that they would cover both the ingress message and the signature while in fact they only covered the message itself. So for two ingress messages with the same payloads and signers but different signatures, their IDs would be identical. Consequently, it could happen that some nodes reconstructed the blocks with the correct ingress messages, but with signatures different (but still valid) from the ones the block maker used in the original block proposal. As a result, the blocks did not pass the block integrity validation, leading to the respective nodes discarding the block as invalid. If enough nodes made the decision to discard such blocks, the subnet stalled.

Scope and severity of impact

From an end user perspective, the incident affected subnet lhg73, in particular update calls were not working for the duration of the incident.

Known services to be affected:

ICPSwap

The incident also affected subnet pzp6e where several nodes also stopped notarizing blocks, but not enough to stall the whole subnet. So things there recovered automatically.

Subnet lhg73 had (unrelated) issues earlier last year, and some community members asked why it was the most affected subnet again for this incident. Unfortunately, it seems that it was just bad luck that this subnet received the messages triggering the issue. Our current best guess is that the messages which triggered the stall came from some application on that subnet, e.g., some user clicking multiple times in some frontend and thereby sending off multiple instances of the same ingress messages with different signatures.

Timeline of events (UTC)

2024-12-26 14:58 Subnet lhg73 stalls.
2024-12-26 ~18:00 by analyzing the logs, DFINITY engineers establish that the likely problem is somewhere in consensus, and that problem manifests itself very rarely, hence the decision is made to recover the subnet to the same replica version and continue debugging the following day.
2024-12-26 18:39 the recovery process begins.
2024-12-26 20:43 the recovery process finishes.
2024-12-26 20:59 Subnet lhg73 recovers.

2024-12-27 09:40 A likely root cause is identified.
2024-12-27 13:02 The stall is reproduced on a testnet and the fix is verified to work.
2024-12-27 13:49 Proposals 134608 and 134609 to elect hotfixed replica versions are created.
2024-12-27 15:21 Proposals 134608 and 134609 are executed.
2024-12-27 15:39 Gradual roll-out has started.

2024-12-27 17:00 Subnet lhg73 stalls again.
2024-12-27 18:05 the recovery process begins.
2024-12-27 19:12 the recovery process finishes.
2024-12-27 19:20 Subnet lhg73 recovers and runs on the hotfixed version.

2024-12-27 19:53 NNS Subnet is updated to the hotfix version.
2024-12-28 11:58 pzp6e is updated to the hotfix version.

What went well

Disabling the feature was easy and worked as designed - the whole hashes-in-block feature was hidden behind a flag so switching the feature off only required to flip a flag. The scenario of downgrading to a version without the feature had been tested multiple times (including on a live subnet).
Relevant experts quickly arrived on scene and began investigating, despite it being a public holiday - it only took several minutes for the relevant expert to join the Zoom call.
The recovery was relatively fast - once it was clear that a subnet recovery was necessary, it took 2 hours to recover the subnet on the first day, and 1 hour on the second day.

What went wrong

Bug in the design of the hashes-in-blocks feature - there was a bug in the way hashes of ingress messages were computed which resulted in different peers reconstructing blocks with different hashes resulting in a conflict that led to the subnet stalling.
Alerts - the alert supposed to notify the on-call engineer of a slow subnet did not fire because it was shadowed by an older alert about slow finalization rate on the same subnet that was accidentally left open. Additionally, there is no alert in place which would be fired if a subnet is completely stalled.
No ssh read-only access to other nodes from the recovery node - as part of the recovery the ic-recovery tool sometimes needs to download certifications from the other peers. Luckily the subnet recoveries here didn’t actually require downloading anything so this step of recovery could be skipped. The download step was initially started nevertheless which uncovered a misconfiguration of the firewall. This misconfiguration would have prevented read-only ssh access to the nodes from the recovery node and would have made a recovery where a download of certifications is required more difficult.
Should have deployed the f ix to the affected subnet quicker - the rollout of a replica version proceeds in batches of NNS proposals. Only after one batch is successfully completed the system would send out the NNS proposals for the next batch. The second stall (on 2024-12-27) could have been avoided if the lhg73 would have been part of (one of) the first batches. Note that a subnet being part of an early batch is a risk by itself as there is less data on how already upgraded subnets deal with the new version. So the subnet being in an earlier batch would have been a risk as well.
There are no alerts for invalid consensus artifacts - when a replica deems a consensus artifact it received from a peer invalid it removes it from its pool, emits a warning message, and increments a metric: consensus_invalidated_artifacts. A replica should receive an invalid artifact only in two cases: either there is a bug in implementation which causes a peer to produce an invalid artifact or the peer acts maliciously. Currently there are no consumers of consensus_invalidated_artifacts and no alert is emitted when the counter increments. Having an alert would help to spot such issues earlier.

Where we got lucky

The primary on-call was familiar with the ic-recovery tool and was ready to do the recovery
The primary on-call worked on the feature which caused the issue so they could debug the issue straight away

Lessons learned & next steps

An alert should trigger when consensus_invalidated_artifacts increases.
An alert should trigger when a subnet completely stalls.
The node firewall rules need to be adjusted to allow nodes ssh-ing into other nodes (read-only access).
Design process failed - the problem should have been identified in the design phase.

Besides the actions resulting from the lessons learned above, the most important next step is to fix the implementation to be able to re-enable the hashes-in-blocks feature on mainnet. We anticipate a fix to be ready within the next weeks.

Lorimer · January 22, 2025, 8:50am

Thanks @Kamil, as always this is a very informative postmortem. I really value that time is allocated to putting these together and making them available to the community.

Lorimer · January 24, 2025, 6:43pm

For anyone who’s interested →

christian · January 26, 2025, 3:29pm

Nice write up, thanks Kamil! The alert on incorrect artifacts is indeed a great idea.

Is there a reason why not the standard IC message IDs were used? (the ones that are used to check the message status against the boundary API)

Kamil · January 29, 2025, 12:35pm

Hi Christian! Long time no see, I hope you’re doing well

Is there a reason why not the standard IC message IDs were used?

The standard IC message ID is precisely what we used. The problem was that these IDs don’t cover the signature because deduplication should not care about them: assume we have two identical messages with different signatures, we only want one of them to be accepted in a block and be executed.

icpalist · January 31, 2025, 1:16am

So if nodes in a subnet collude (see https://forum.dfinity.org/t/sybiling-nodes-exploiting-ic-network-community-attention-required) not even Dfinity will know.

Imagine IC having TVL of billions in DeFi apps. It would be too tempting to nodes to not try, having no stake to lose.

derlerd-dfinity1 · January 31, 2025, 10:39am

Meanwhile there are alerts for invalid artifacts.

Kamil · January 31, 2025, 4:33pm

As David said, we implemented the alerts shortly after the postmortem was published

icpalist · February 5, 2025, 2:21pm

Are those alerts public?

Kamil · February 7, 2025, 1:25pm