Post-mortem: lhg73 stalls due to outdated assert in code

dsarlis · September 27, 2024, 12:15pm

Hey everyone, here’s the post-mortem for the incident on lhg73 on September 15th 2024. If you have any questions, let us know in the comments.

Summary

On September 15, 2024 between 15:16 UTC and 23:14 UTC, the IC Mainnent subnet lhg73 was stalled after an assert was hit during cleanup callback handling. This resulted in replicas crash-looping and being unable to make progress. The assert was deprecated as it was checking for an outdated condition that was no longer true after some other code changes. The fix was rather simple, the outdated assert was removed and after the subnet was recovered, it started making progress again.

Scope and severity of impact

From an end user perspective, the incident affected subnet lhg73, in particular update calls were not working for the duration of the incident. Query calls started working after the subnet was halted and the replicas stopped restarting.

Known services to be affected:

ICPSwap

Timeline of events (UTC)

Key Events (delay since incident start in parenthesis)

15:16: Incident Start
16:24: Issue Detected (+1h)
16:31-16:48: Escalated to stakeholders (+1h15min)
18:03: Fix ready (+3h)
20:11: Subnet Recovery Cup Ready (+5h)
21:26: Fix passed qualification tests (+6h)
22:19: Replica Version Elected(+7h)
23:03: Fix deployed (+7h45min)
23:14: Subnet is running normally again (+7h58min)

Detailed timeline

2024-09-15:

15:16: Subnet enters crash-loop (goes unnoticed until about 1 hour later).
16:24: Primary on-call gets several XNetStreamStalled alerts for lhg73. Investigation begins.
16:34: Checking the replica logs reveals that a panic is hit in execution environment
16:35-16:48: More people get paged and start helping with investigation.
16:51: The theory that the assert is outdated and checks the wrong thing is mentioned
16:55: A fix is being prepared that removes the outdated assert
16:57: Consensus team members are paged to help with subnet recovery
17:26: Internal tests confirm that calling cycles_burn also triggers the condition
18:03: PRs are created with the hotfix for both RC branches (with and without snapshots).
18:22: CI tests and the release qualification on the CI failed. This delays the qualification process a bit.
18:53: Subnet is halted.
19:31: More people are paged to help with CI issues.
19:39: Finished copying the state and the consensus pool to the recovery working directory, via cp
20:11: Successfully finished replaying blocks until (and including) height 116106237
20:33-20:53: The following steps are executed manually (to avoid rsyncing the state)
21:26: CI qualification passed
21:46: GuestOS upgrade image was manually built and uploaded to S3 by engineers to unblock proposal submission
22:19: Replica version with hotfix elected.
22:26: Subnet’s lhg73 version set to the hotfix one.
22:35: HostOS and SetupOS images were manually build and uploaded to S3 by engineers to unblock repro-check completion
22:49: Set recovery cup for subnet lhg73.
23:03: Unhalted subnet lgh73.
23:14: Subnet lhg73 starts making progress again.

What went well

Automated monitoring caught the issue and primary on-call was paged
Relevant experts were gathered and acted quickly (even though it was a Sunday evening)
The community was kept informed on the forum

Lessons learned

Incident response

The on-call engineer didn’t have access to update the status page. We managed to work around it because we had another person on the call with access.
XNetStreamStalled alert was received an hour after the subnet had already been in the crash-loop state. The team needs to be alerted faster than this in similar situations.

Subnet recovery

IC recovery has a pretty strict timeout for computing the manifest of the starting state. If we don’t have an old manifest around in states_metadata.pbuf the state manager will start hashing the entire state which will likely not fit into the timeout.
Copying instead of rsync worked well – we were much faster but large states still take quite long to be copied. We should change the ic-recovery tool to use cp instead of rsync if the recovery is being done on a node. Also, hardlinking instead of copying (and its implications) might be a good direction to look into.

Security hotfix process

This was the first time that the security hotfix process was followed since switching to the public github repo. We realized that the process is quite long and difficult to follow during incidents. It should be streamlined more to help with quick remediation of such issues.
We should be able to override automation quickly when it fails, e.g. to upload artifacts or to be able to submit election proposals with 1 download link instead of 2.

Prevention of bugs

Extend fuzz tests around system API to exercise the replicated state machine logic as well (instead of only running the Wasm code).
Review the code around storage reservation and decide if existing test coverage is enough.
Having a clear set of best practices about usage of asserts/panics in the Deterministic State Machine can help use them only when necessary.

Action items

Add alert for subnet in a crash-loop (DONE)
Provide access to all on-call engineers to update the status page (DONE)
Extend system api fuzz tests to cover execution environment
Review storage reservation code and test coverage
Simplify instructions for preparing a security hotfix
Write best practices on usage of asserts/panics in DSM

Technical details

Commit 1d4692f (merged October 4th, 2022) introduced an assert that was checking that there were no cycles removed from a canister’s cycle balance during a cleanup callback execution. At the time, this assert was accurate as there was indeed no case where the canister’s cycle balance would reduce while executing the cleanup callback. Therefore it made sense to have the assert to ensure the invariant holds.

However, over time, a couple cases were added that could result in such a reduction in the canister’s cycle balance and broke the invariant. Specifically, the addition of the ic0.cycles_burn128 API – which is a direct way to remove cycles from the canister’s balance – as well as the storage reservation mechanism which will reserve cycles from the canister’s balance when memory allocations happen over a certain threshold.

Subnet lhg73 is one of the biggest in terms of state size (it’s around 500GiB). This means that the storage reservation mechanism kicks in (as the threshold for it to kick in is 450GiB) and it so happened that a cleanup callback was executed that needed to perform some memory allocation which triggered the reservation of cycles and consequently hitting the (outdated) assert.

Topic		Replies	Views
Post-mortem: lgh73 stalls due to cycles reservation validation bug Developers	0	99	July 9, 2024
Postmortem: lhg73 subnet incident on Dec 26th 2024 General	9	227	February 7, 2025
Subnet `lhg73` is stalled NNS proposal discussions Subnet-management	38	735	September 18, 2024
Subnet lhg73 is stalled NNS proposal discussions Subnet-management	13	407	December 29, 2024
Subnet `pjljw` Incident Retrospective - Tuesday, November 23 Developers	6	1327	November 30, 2021