Summary
The state of subnet pjljw
diverged causing the subnet to stall. The subnet became effectively read-only for ~12 hours until a disaster recovery with a hotfix was performed.
The state divergence was in the counter that keeps track of modified memory pages. The actual contents of the memory pages did not diverge. Non-determinism caused some pages to be marked as modified even though they were actually clean. The hotfix ensures that the list of modified pages is computed based on the actual bytes in memory.
Impact
The subnet was in read-only mode for ~12 hours.
Timeline (UTC)
- 12:10: The finalization rate of pjljw drops to 0.
- 12:15: Code red is declared. Investigation starts.
- 12:50: It is clear that the incident is caused by state divergence.
- 12:59: The incident is added to the status page.
- 13:00 - 15:00: The team is trying to find out which bits of the state diverged.
- 15:00: The state divergence is narrowed down to the heap delta counter in the canister metadata.
- 15:00 - 17:00: The team is trying to reproduce the divergence locally using
ic-replay
. - 17:00: A hotfix is proposed after auditing code related to the heap delta counter. It cannot be verified because there is no local reproduction yet.
- 17:00 - 20:30: The team continues to try to reproduce the divergence locally using
ic-replay
, which is very slow due to the large state. - 20:30: The team decides to go ahead with the hotfix because there is no progress with the local replay.
- 21:00 - 23:05: The team performs disaster recovery on the subnet.
- 23:21: The incident is marked as resolved in the status page.
What went wrong?
- Missing outstanding replies can block a canister from stopping and upgrading.
The incident happened just before the deadline of proposal 30496 to upgrade the ledger canister and enable canisters to transfer ICP, which was aborted due to this risk. - The execution stack had a source of non-determinism, which was not caught by tests and production until now.
- The
ic-replay
tool had a deadlock bug that the team had to debug and fix while investigating the main incident. - The disaster recovery environment was not optimized for large states.
What went right?
- The team quickly found that the root cause of the finalization rate drop is state divergence.
- The team narrowed down the divergence to specific bits in the canister state.
- Disaster recovery of the subnet went smoothly.
Action items
- Add tests and instrumentation to ensure that the signal handler is deterministic.
- Audit all system calls that access the WebAssembly memory and ensure that they are deterministic.
- Fix the discovered bugs in the
ic-replay
tool. - Improve replaying and disaster recovery for large states.
Technical details
The state divergence was caused by a combination of two issues:
- The implementation of the
ic0.msg_arg_data_copy()
system call usescopy_from_slice()
to copy the message argument data to the WebAssembly memory. Internally that function uses__memmove_avx_unaligned()
oflibc
, which writes into memory either in the increasing or decreasing order of addresses depending on the addresses of the arguments. Thus, the order of writes is not deterministic and may differ from node to node. - The signal handler relies on the deterministic order of writes for keeping track of modified pages. If that doesn’t hold, then the signal handler may erroneously mark some clean pages as modified. Because of that, the counter that keeps track of the modified pages becomes non-deterministic. The actual bytes in memory are not affected and remain deterministic.
The hotfix removes the dependency of the signal handler on the order of writes and ensures that the list of modified pages is computed based on the actual bytes in memory. As a follow-up action item, the non-deterministic order of writes in system calls should be fixed, but that is not critical and doesn’t affect the overall determinism of execution anymore.