Folks,
As some of you may have noticed, the bitcoin integration on the IC was unavailable for a few hours on April 22nd. In this post, I’ll include some details on the incident itself, why it happened, and how we resolved it. Don’t hesitate to follow up if you have any questions.
Impact
The bitcoin integration was down for several hours on April 22nd. The outage lasted from 03.56UTC until 19.20UTC.
Root cause
- The bitcoin integration fetches bitcoin blocks continuously from the bitcoin network.
- Transporting blocks from external bitcoin nodes to the bitcoin integration relies on a special queue known as the “consensus queue”.
- The IC maintains an invariant that, after executing an IC block, the consensus queue is empty.
- An edge case was discovered that breaks this invariant, causing the replicas on the bitcoin subnet to crash.
- The edge case relates to a concept introduced back in 2021 known as the “heap delta limit”, which limits the amount of writes that canisters on a subnet can do within a given point in time.
- On Apr 22, 2024, due to a surge in the bitcoin testnet blocks, the bitcoin subnet made a substantial amount of writes, causing the subnet to hit the heap delta limit.
- Hitting the heap delta limit is conceptually not a problem, but in that case the code path in the execution layer exited early without clearing the consensus queues, hence breaking the invariant and crashing.
Mitigation
A fix has been deployed that prioritizes processing the consensus queue before evaluating the subnet heap delta limit. This ensures the consensus queue invariant holds.
Timeline of events (UTC)
Incident Unfolding
- 02:15: The finalization rate of the bitcoin subnet drops from 1.3 to 0.2 blocks/s
- 03:54: The replicas of the bitcoin subnet panic with “Consensus queue not empty”
Investigating and Fixing
- 05:09: The issue was identified
- 07:51: Subnet recovery plan was proposed
- 07:53: The hotfix was prepared
- 08:06: The bitcoin subnet was halted
- 09:50: The hotfix was merged
Recovering the subnet:
- 10:46: The RC qualification pipeline has passed
- 12:51: The proposal to upgrade the bitcoin subnet to the hotfix was executed (Dashboard)
- 14:10: The proposal to update recovery cup was created (Dashboard)
- 14:34: The bitcoin subnet was unhalted (Dashboard)
Post-incident monitoring:
- 15:07: The subnet was recovered and the bitcoin integration resumed syncing blocks
- 19.20: The bitcoin mainnet reached the tip and was functional again.
Action Items
- Fix the issue to ensure subnets correctly handle the case when the heap delta limit is reached [DONE]
- Explore how we can remove brittle invariants that can easily be violated.