Hey everyone, here’s the post-mortem for the incident on lhg73 on September 15th 2024. If you have any questions, let us know in the comments.
Summary
On September 15, 2024 between 15:16 UTC and 23:14 UTC, the IC Mainnent subnet lhg73 was stalled after an assert was hit during cleanup callback handling. This resulted in replicas crash-looping and being unable to make progress. The assert was deprecated as it was checking for an outdated condition that was no longer true after some other code changes. The fix was rather simple, the outdated assert was removed and after the subnet was recovered, it started making progress again.
Scope and severity of impact
From an end user perspective, the incident affected subnet lhg73, in particular update calls were not working for the duration of the incident. Query calls started working after the subnet was halted and the replicas stopped restarting.
Known services to be affected:
Timeline of events (UTC)
Key Events (delay since incident start in parenthesis)
- 15:16: Incident Start
- 16:24: Issue Detected (+1h)
- 16:31-16:48: Escalated to stakeholders (+1h15min)
- 18:03: Fix ready (+3h)
- 20:11: Subnet Recovery Cup Ready (+5h)
- 21:26: Fix passed qualification tests (+6h)
- 22:19: Replica Version Elected(+7h)
- 23:03: Fix deployed (+7h45min)
- 23:14: Subnet is running normally again (+7h58min)
Detailed timeline
2024-09-15:
- 15:16: Subnet enters crash-loop (goes unnoticed until about 1 hour later).
- 16:24: Primary on-call gets several XNetStreamStalled alerts for lhg73. Investigation begins.
- 16:34: Checking the replica logs reveals that a panic is hit in execution environment
- 16:35-16:48: More people get paged and start helping with investigation.
- 16:51: The theory that the assert is outdated and checks the wrong thing is mentioned
- 16:55: A fix is being prepared that removes the outdated assert
- 16:57: Consensus team members are paged to help with subnet recovery
- 17:26: Internal tests confirm that calling cycles_burn also triggers the condition
- 18:03: PRs are created with the hotfix for both RC branches (with and without snapshots).
- 18:22: CI tests and the release qualification on the CI failed. This delays the qualification process a bit.
- 18:53: Subnet is halted.
- 19:31: More people are paged to help with CI issues.
- 19:39: Finished copying the state and the consensus pool to the recovery working directory, via cp
- 20:11: Successfully finished replaying blocks until (and including) height 116106237
- 20:33-20:53: The following steps are executed manually (to avoid rsyncing the state)
- 21:26: CI qualification passed
- 21:46: GuestOS upgrade image was manually built and uploaded to S3 by engineers to unblock proposal submission
- 22:19: Replica version with hotfix elected.
- 22:26: Subnet’s lhg73 version set to the hotfix one.
- 22:35: HostOS and SetupOS images were manually build and uploaded to S3 by engineers to unblock repro-check completion
- 22:49: Set recovery cup for subnet lhg73.
- 23:03: Unhalted subnet lgh73.
- 23:14: Subnet lhg73 starts making progress again.
What went well
- Automated monitoring caught the issue and primary on-call was paged
- Relevant experts were gathered and acted quickly (even though it was a Sunday evening)
- The community was kept informed on the forum
Lessons learned
Incident response
- The on-call engineer didn’t have access to update the status page. We managed to work around it because we had another person on the call with access.
- XNetStreamStalled alert was received an hour after the subnet had already been in the crash-loop state. The team needs to be alerted faster than this in similar situations.
Subnet recovery
- IC recovery has a pretty strict timeout for computing the manifest of the starting state. If we don’t have an old manifest around in states_metadata.pbuf the state manager will start hashing the entire state which will likely not fit into the timeout.
- Copying instead of rsync worked well – we were much faster but large states still take quite long to be copied. We should change the ic-recovery tool to use cp instead of rsync if the recovery is being done on a node. Also, hardlinking instead of copying (and its implications) might be a good direction to look into.
Security hotfix process
-
This was the first time that the security hotfix process was followed since switching to the public github repo. We realized that the process is quite long and difficult to follow during incidents. It should be streamlined more to help with quick remediation of such issues.
-
We should be able to override automation quickly when it fails, e.g. to upload artifacts or to be able to submit election proposals with 1 download link instead of 2.
Prevention of bugs
- Extend fuzz tests around system API to exercise the replicated state machine logic as well (instead of only running the Wasm code).
- Review the code around storage reservation and decide if existing test coverage is enough.
- Having a clear set of best practices about usage of asserts/panics in the Deterministic State Machine can help use them only when necessary.
Action items
- Add alert for subnet in a crash-loop (DONE)
- Provide access to all on-call engineers to update the status page (DONE)
- Extend system api fuzz tests to cover execution environment
- Review storage reservation code and test coverage
- Simplify instructions for preparing a security hotfix
- Write best practices on usage of asserts/panics in DSM
Technical details
Commit 1d4692f (merged October 4th, 2022) introduced an assert that was checking that there were no cycles removed from a canister’s cycle balance during a cleanup callback execution. At the time, this assert was accurate as there was indeed no case where the canister’s cycle balance would reduce while executing the cleanup callback. Therefore it made sense to have the assert to ensure the invariant holds.
However, over time, a couple cases were added that could result in such a reduction in the canister’s cycle balance and broke the invariant. Specifically, the addition of the ic0.cycles_burn128 API – which is a direct way to remove cycles from the canister’s balance – as well as the storage reservation mechanism which will reserve cycles from the canister’s balance when memory allocations happen over a certain threshold.
Subnet lhg73 is one of the biggest in terms of state size (it’s around 500GiB). This means that the storage reservation mechanism kicks in (as the threshold for it to kick in is 450GiB) and it so happened that a cleanup callback was executed that needed to perform some memory allocation which triggered the reservation of cycles and consequently hitting the (outdated) assert.