Canister Output Queue Overflow Incident Retrospective - Tuesday July 12, 2022

Summary

Due to a bug in the replica code a canister sending messages to the IC management canister could potentially overflow its output message queue and cause the replica to crash. Exploiting the bug would allow an attacker to stall any subnet. The bug is a recent regression introduced on 2022-05-16.

A new replica version proposed in proposal 69804 contains a fix for the bug.

Timeline

2022-07-07: A DFINITY engineer observes a crash on a testnet while stress testing ECDSA signatures.

2022-07-08: The team submits a proposal to disable ECDSA signing (69117) which is adopted.

2022-07-11: The team finds that the root cause is not specific to ECDSA and starts working on a fix.

2022-07-12 (All times in UTC):

  • 12:00: The fix is ready.
  • 12:42: Quiet period starts: Mirroring to the public github repository is disabled according to the IC’s security patch policy until the issue has been fixed.
  • 14:23: The hotfix rollout process starts.
  • 17:02: NNS proposal 69804 to elect new replica binary revision.
  • 17:55: NNS proposal to upgrade subnet io67a.
  • 19:30: NNS proposals to upgrade subnets shefu, 2fq7c, w4asl.
  • 21:03: NNS proposals to upgrade subnets pjljw, x33ed, uzr34, snjp4, qxesv, gmq5v.

2022-07-13 (All times in UTC):

  • 08:59: NNS proposals to upgrade subnets 4zbus, ejbmu, lspz2.
  • 09:37: NNS proposals to upgrade subnets pae4o, 5kdm2, csyj4.
  • 10:31: NNS proposals to upgrade subnets brlsh, cv73p, 4ecnw.
  • 11:48: NNS proposals to upgrade subnets lhg73, opn46, 3hhby.
  • 12:36: NNS proposals to upgrade subnets 6pbhf, e66qm, qdvhd, fuqsr.
  • 13:14: NNS proposals to upgrade subnets k44fs, yinp6, mpubz, o3ow2.
  • 14:25: NNS proposals to upgrade subnets w4rem, eq6en, jtdsg, nl6hn.
  • 15:38: NNS proposal to upgrade subnets tdb26.
  • 16:13: Quiet period ends and the github source for the elected replica version with the fix is now publicly available on github.

What went wrong

  • The bug slipped through the tests and code review.

Where did we get lucky?

  • The bug was found by a DFINITY engineer while stress testing ECDSA signatures.

What went right?

  • The rollout of the fix was smooth.

Action items

  • Add regression tests for cross-subnet IC management calls.

Technical details

Each subnet has its own IC management canister. However all of them share the same canister id: aaaaa-aa. This means that the receiver of an IC management message needs to be resolved based on the network topology and the message payload in order to push the message to the right output queue. Previously the resolution was done by the Wasm executor in the handler of ic0.call_perform(). This was inefficient and complex due to the dependency of the Wasm executor on network topology. A simplification change moved the receiver resolution out of the Wasm executor. This however broke the available message slot checks in ic0.call_perform() because the check started using aaaaa-aa as the canister id instead of the actual receiver. The hotfix adds proper handling of aaaaa-aa in ic0.call_perform().

11 Likes