Subnet `pjljw` Incident Retrospective - Tuesday, November 23

ulan · November 26, 2021, 12:35pm

Summary

The state of subnet pjljw diverged causing the subnet to stall. The subnet became effectively read-only for ~12 hours until a disaster recovery with a hotfix was performed.

The state divergence was in the counter that keeps track of modified memory pages. The actual contents of the memory pages did not diverge. Non-determinism caused some pages to be marked as modified even though they were actually clean. The hotfix ensures that the list of modified pages is computed based on the actual bytes in memory.

Impact

The subnet was in read-only mode for ~12 hours.

Timeline (UTC)

12:10: The finalization rate of pjljw drops to 0.
12:15: Code red is declared. Investigation starts.
12:50: It is clear that the incident is caused by state divergence.
12:59: The incident is added to the status page.
13:00 - 15:00: The team is trying to find out which bits of the state diverged.
15:00: The state divergence is narrowed down to the heap delta counter in the canister metadata.
15:00 - 17:00: The team is trying to reproduce the divergence locally using ic-replay.
17:00: A hotfix is proposed after auditing code related to the heap delta counter. It cannot be verified because there is no local reproduction yet.
17:00 - 20:30: The team continues to try to reproduce the divergence locally using ic-replay, which is very slow due to the large state.
20:30: The team decides to go ahead with the hotfix because there is no progress with the local replay.
21:00 - 23:05: The team performs disaster recovery on the subnet.
23:21: The incident is marked as resolved in the status page.

What went wrong?

Missing outstanding replies can block a canister from stopping and upgrading.
The incident happened just before the deadline of proposal 30496 to upgrade the ledger canister and enable canisters to transfer ICP, which was aborted due to this risk.
The execution stack had a source of non-determinism, which was not caught by tests and production until now.
The ic-replay tool had a deadlock bug that the team had to debug and fix while investigating the main incident.
The disaster recovery environment was not optimized for large states.

What went right?

The team quickly found that the root cause of the finalization rate drop is state divergence.
The team narrowed down the divergence to specific bits in the canister state.
Disaster recovery of the subnet went smoothly.

Action items

Add tests and instrumentation to ensure that the signal handler is deterministic.
Audit all system calls that access the WebAssembly memory and ensure that they are deterministic.
Fix the discovered bugs in the ic-replay tool.
Improve replaying and disaster recovery for large states.

Technical details

The state divergence was caused by a combination of two issues:

The implementation of the ic0.msg_arg_data_copy() system call uses copy_from_slice() to copy the message argument data to the WebAssembly memory. Internally that function uses __memmove_avx_unaligned() of libc, which writes into memory either in the increasing or decreasing order of addresses depending on the addresses of the arguments. Thus, the order of writes is not deterministic and may differ from node to node.
The signal handler relies on the deterministic order of writes for keeping track of modified pages. If that doesn’t hold, then the signal handler may erroneously mark some clean pages as modified. Because of that, the counter that keeps track of the modified pages becomes non-deterministic. The actual bytes in memory are not affected and remain deterministic.

The hotfix removes the dependency of the signal handler on the order of writes and ensures that the list of modified pages is computed based on the actual bytes in memory. As a follow-up action item, the non-deterministic order of writes in system calls should be fixed, but that is not critical and doesn’t affect the overall determinism of execution anymore.

mparikh · November 27, 2021, 2:14am

Thanks! A quick question : what was the RTO & RPO for DR?

jzxchiang · November 27, 2021, 4:59am

What is ic-replay used for? Is that open source yet?

GLdev · November 27, 2021, 7:18am

Wanted to ask this for a while, I guess this topic kind of fits… Is there a way to choose what subnet a new canister is deployed to? I’m considering having some functionality that can be handled by multiple canisters, based on a “processing que”. I was thinking that having canisters in multiple subnets could help.

ulan · November 29, 2021, 3:27pm

It is a tool used in disaster recovery and it was open sourced today: ic/rs/replay at master · dfinity/ic · GitHub

According to this thread is it not possible currently, but that may change in the future.

I am not an expert in that area, so I am going to defer this questions to others.

northman · November 29, 2021, 8:53pm

That was a very informative incident retrospective.

Is there a subnet status page using Traffic Light Protocol (Red, Yellow, Green).

I did see finalization rates drop - but is there something succinct like a dashboard on status where one does not need to understand the complexities of some of the tech?

If not, it could be an opportunity for improvement.

ulan · November 30, 2021, 2:33pm

Thanks @northman! That’s a good idea. I’ll suggest that to the team maintaining the IC dashboard.

Topic		Replies	Views
Subnets `mpubz`, `brlsh`, and `pjljw` Incident Retrospective - Friday October 15, 2021 Developers	19	1321	October 20, 2021
High User Traffic Incident Retrospective - Thursday September 2, 2021 Developers	50	8952	October 30, 2021
Block Size Calculation Incident Retrospective - Thursday, May 5th Developers	0	570	May 9, 2022
Pjljw is completely down. 0 blocks/s Developers	9	1277	October 20, 2021
Post Mortem: subnets cv73p and 4ecnw failed upgrade process on March 27, 2024 Developers	0	271	April 19, 2024