Subnets `mpubz`, `brlsh`, and `pjljw` Incident Retrospective - Friday October 15, 2021

diegop · October 15, 2021, 4:29pm

Hi folks,

On October 15 2021 at 01:23 UTC, there was an incident where multiple subnets had low finalization rates and were inaccessible.

Status page: InternetComputer Status - Partial outage: Subnets mpubz and brlsh are currently down. Subnet pjljw is currently read-only, recovery underway.

The issue has been resolved. We are working on an incident report and investigating. Our current trajectory is that we publish more early next week (as we investigate the issue deeper).

We will publish the incident report in this dev forum thread.

So please stay tuned.

diegop · October 19, 2021, 7:36pm

Summary

The maximum heap-delta limit, a configuration parameter that guards against accumulating too many modified pages in memory, had an incorrect value of 1TiB set before genesis. The parameter was set higher than available RAM on replicas. The increased workload on multiple subnets jtdsg, pjljw, mpubz, brlsh caused the heap delta to grow to 512GiB and crash with out-of-memory. Two subnets pjljw, mpubz had to be disaster-recovered.

Impact

Affected subnets could not process update or query calls while the replicas were crash-looping. After halting, and before the subnets were disaster recovered, queries could be processed, but update calls would not work. Once the subnets were disaster recovered, the subnets were fully operational. No state was lost.

Timeline

All times are on UTC:

2021-10-14 15:00: pjljw finalization rate starts gradually dropping
2021-10-14 17:20: pjljw starts crashing with out-of-memory (OOM).
2021-10-14 22:34: jtdsg starts crashing with OOM.
2021-10-14 22:45: jtdsg has zero finalization, then recovers on its own.
2021-10-15 00:40: mpubz suffers the same symptoms as jtdsg and pjljw.
2021-10-15 01:00: mpubz starts crashing with OOM.
2021-10-15 01:23: Status Page: Investigating this issue (pjljw)
2021-10-15 01:30: DFINITY team predicts mpubz will recover once it reaches CUP, as was observed in jtdsg.
2021-10-15 01:30: DFINITY team decides to stop ingress messages to affected subnets.
2021-10-15 01:35: DFINITY team removes nodes for pjljw and mpubz from boundary node routes.
2021-10-15 01:44: Finalization rate in mpubz rises a little before dropping to zero.
2021-10-15 02:05: Status Page: We are continuing to investigate this issue (pjljw + mpubz)
2021-10-15 02:45: NNS proposal 25326: Reduce DKG interval from 499 to 200 on mpubz.
2021-10-15 03:00: Team prepares an NNS proposal as a mitigation for the heap delta limit. Team starts building the binary, which is complicated by subnets running different replica versions.
2021-10-15 03:20: NNS proposal 25331: Halt subnet pjljw.
2021-10-15 03:44: Team notes that brlsh is showing early symptoms by looking at the finalization rate and memory.
2021-10-15 04:00: Team adds nodes back for pjljw to boundary_nodes routes
2021-10-15 04:09: Status Page: issue identified, fix being implemented.
2021-10-15 04:45: NNS proposal 25338: Bless a replica version for a new release candidate.
2021-10-15 04:48: NNS proposal 25339: Upgrade subnet pjljw.
2021-10-15 04:45: NNS proposal 25343: Bless replica version with a mitigation.
2021-10-15 05:10: brlsh starts crashing with OOM.
2021-10-15 05:32: Status Page: continuing to work on a fix (brlsh is read-only)
2021-10-15 05:38: NNS proposal 25344: Upgrade subnet brlsh.
2021-10-15 05:38: Status Page: continuing to work on a fix (brlsh down).
2021-10-15 06:00-11:30: Team is

is doing disaster recovery of pjljw
doing disaster recovery of mpubz.
updating all remaining subnets.
re-enabling traffic on the boundary nodes.

2021-10-15 07:02: NNS proposal 25351: Halt subnet mpubz.
2021-10-15 07:31: NNS proposal 25354: Update subnet mpubz.
2021-10-15 07:40: NNS proposal 25359: Recover subnet pjljw.
2021-10-15 08:09: NNS proposal 25363: Unhalt subnet pjljw. pjljw is now back online.
2021-10-15 10:08: NNS proposal 25382: Recover subnet mpubz.
2021-10-15 11:05: NNS proposal 25385: Set dkg interval back to 499 on mpubz.
2021-10-15 11:05: NNS proposal 25393: Unhalt subnet mpubz.
2021-10-15 11:15: mpubz is back online. Incident is resolved.
2021-10-15 11:30: Status Page: This incident has been resolved.

What went wrong?

There were no alerts for replica restarts. The only alert that fired was one for slow finalization, which also fired frequently in the past.
Heap delta limit was configured incorrectly. It was assumed that under high memory pressure some memory pages would be flushed to disk but this did not happen.
One small instance of OOM crash happened two weeks ago but wasn’t investigated and followed up on.
Manual steps in the process (e.g. copy-paste hash versions and subnet names) led to a couple of mistakes (wrong NNS proposals) while making NNS proposals to fix the remaining subnets.
Besides telemetry and monitoring, it would be incredibly useful if the DFINITY team had canisters on each subnet that allows the team to test how each subnet is behaving (there was already a project working towards this already).

What went right?

Halting the affected subnets allowed them to go to read-only mode instead of both reads and writes being broken.
Disaster recovery of affected subnets went smoothly.
The proposals to mitigate the incident were quickly adopted by the community.

Technical details

High-level version:

The execution layer needs to keep track of memory changes made by canisters during message execution until the next checkpoint when they can be flushed to disk. In this case, the increased workload meant that there were too many such changes, and the incorrect limit for those changes meant that replicas started running out of memory and crashing.

Action items

Add alerts for replicas restarting (outside normal operations, like a subnet upgrade).
Add performance tests for the memory workload on multiple canisters in parallel.
Ability to stop message processing via an NNS proposal without halting the subnet. Catch Up Packages (CUPs) are still generated and nodes that are behind can still catch up.
Alerts for when a replica is under memory pressure

weedpatch2 · October 20, 2021, 4:58pm

Awesome AAR, @diegop! I really appreciate this sort of report. It brought up a couple of questions in my mind.

The “subnet debug” canisters you mentioned in the final bullet of ‘What went wrong?’ sound like a good idea. I worry if these subnets would need to have special permissions to gain access to some subnet metadata of interest during debugging, these canisters could become a target for attackers. (Feels similar to the way k8s pods have access to the k8s runtime, which is a significant attack surface). If no special permissions/API will be needed for these canisters, then I’m ALL FOR the “subnet debug canister” idea. Offering it as an OIS would be SUPER interesting. I can see ic.rocks or the dashboard displaying granular real-time subnet metrics. I can also see dApp devs using that OIS to do application scaling, based on dev-defined QoS metrics.
In thinking long into the future, with a fully decentralized Internet Computer operating outside of the control of the Foundation or ICA, I wonder about the “decentralization” of the IC triage team that handles these sort of abstract technical issues. I presume that node operators will each have their own IT staff for hardware related issues. But, down the line, how will the team of specialists that deal with issues such as this be managed? Sorry if this is explained elsewhere. Just send me a link, if you will.

Again, GREAT work on the AAR, DFINITY Team!!!

diegop · October 20, 2021, 5:38pm

This is a fair point. This is very much the intent and description on this has been spread out over many communications and not very clear, to be honest. We are working on more communication and explanation on the road towards getting us there (to be clear, this is on my pile and something I have been touching here and there, but need to finally help ship/publish with the aid of the teams who are doing the real hard work)

diegop · October 20, 2021, 5:39pm

Let me ask the team to see who has an intelligent and coherent opinion on this.

diegop · October 20, 2021, 7:22pm

The team working on this confirmed that the intent is that these canisters are not expected to have any privileges.

jplevyak · October 20, 2021, 7:24pm

We will be setting up a designated canary canister on each subnet which can be used for external probing (very soon) that can be used for alerts and status by anyone. We don’t anticipate that any special permissions will be required (except in so far as some subnets e.g. which are being qualified, may require special permission to create the canary canister).

lastmjs · October 20, 2021, 7:36pm

Is this possibly a mischaracterization? Maybe I missed it, but I wasn’t aware of any of these proposals (I wasn’t watching but even if I were watching I’m not sure how much information I would have had). Seems more like DFINITY took control of the whole process, had all of the information, created all of the proposals, and simply voted on them.

I don’t see any community involvement here, and right now it seems DFINITY effectively has full control over the NNS.

Perhaps this is a discussion for another thread, but how can we actually decentralize the NNS? Even the community roadmap proposals are a nice gesture, but we don’t have any real power. The number of votes from community members is abysmal compared to when DFINITY comes in to finish off the roadmap votes.

And maybe I misunderstand the relationship between DFINITY and the ICA, but seems there is a very very small group of people that are actually clicking buttons to push massive changes through the NNS.

diegop · October 20, 2021, 7:41pm

I guess a more real accurate is that enough of the community trusted DFINITY by following the foundation so that these things could move quickly since the foundation does not have anywhere enough votes by itself.

ironically, this is something we are actively trying to water down by having more people follow other neurons, but we were lucky in this instant

diegop · October 20, 2021, 7:42pm

To be clear the perceived gap is because of following, not the foundations themselves. The most obvious low-hanging fruit is to update both UI and economics around following.

lastmjs · October 20, 2021, 7:42pm

Ah, so DFINITY doesn’t have enough votes alone to push proposals through? How can I verify this?

If that’s true I’m happy with that since yes you would assume those with the votes have chosen to delegate to DFINITY or the ICA.

I think my problem is that I’ve only seen DFINITY say in blog posts or Dom say in tweets that DFINITY and/or the ICA have fewer than 50% of ICP. But can I somehow verify this or get close by querying some canisters/neurons?

diegop · October 20, 2021, 7:45pm

This is a great question. There are articles and other writings, but speaking from a trustless crypto POV, maybe the best way would be to see how much voting power for the proposals came from following. I am not sure this is exposed in an easy way tbh.

I will push this feedback up to the NNS team and others.

lastmjs · October 20, 2021, 7:47pm

This would be fantastic if it doesn’t already exist

diegop · October 20, 2021, 7:49pm

Huge sidenote… if this is surprising to you, Jordan… then we have done A TERRIBLE job communicating this. I am honestly embarrassed by my communication skills if you thought the foundation had enough voting power to push motions through by its mere votes and without following. I definitely accept responsibility here.

lastmjs · October 20, 2021, 7:52pm

Well don’t feel bad, I know it’s been mentioned in blog posts and elsewhere, but honestly I have a hard time believing it without some more verification.

It feels like there is a very small group of people that have more than 50% voting power, so even if DFINITY has 25% and the ICA has 25% (not sure what the numbers are), who even composes these organizations? In my mind DFINITY and the ICA are one and the same.

I just don’t feel like I have enough information to trust the distributions right now.

diegop · October 20, 2021, 7:55pm

I am not surprised that the inertia of trust/UI/economics means lots of people blindly follow the foundation (vs say CycleDao). Accelerating more diverse neuron following is definitely important to the foundation and ecosystem, and I think it is rapidly becoming more urgent.

what is surprising to me (and great feedback) is to at least break down where voting power from proposals comes from. Maybe NNS team has it in its backlog, who knows. but i think that is a very good idea.

lastmjs · October 20, 2021, 7:56pm

That would be an excellent addition

LightningLad91 · October 20, 2021, 8:53pm

I would also like to better understand the degree of separation between Dfinity and the ICA.

weedpatch2 · October 20, 2021, 9:18pm

I have thought about how this diversification of neuron following may occur, or what it may look like ‘in the wild?’ My initial thoughts are that it would likely be achieved through a sort of political party/politician sort of mechanism. An individual could stake a neuron (II is thought to be achieving PoH through people parties, right?), and claim they are a ‘politician’ or ‘governor’ and that they will be voting on and submitting proposals surrounding a specific topic/community that they feel empowered about. That would likely form out of groups of devs that use the IC, and/or core team members of large dApp teams. This same sort of ecosystem would likely form and exist beneath each SNS DAO with DAO Ambassadors, as well. $0.02

weedpatch2 · October 20, 2021, 9:20pm

I am also curious about the difference between these orgs, both in administrative, and technical (which votes for what proposals) sense.

Topic		Replies	Views
High User Traffic Incident Retrospective - Thursday September 2, 2021 Developers	50	8973	October 30, 2021
Subnet `pjljw` Incident Retrospective - Tuesday, November 23 Developers	6	1329	November 30, 2021
Uploading many assets crashes subnets Developers	12	808	November 3, 2021
Question regarding RE EXC-1168: add non-subsidised storage cost on 20+ node subnets (behind the flag) Developers	16	2074	October 30, 2022
Subnets with heavy compute load: what can you do now & next steps Developers	174	4125	November 26, 2024