The maximum heap-delta limit, a configuration parameter that guards against accumulating too many modified pages in memory, had an incorrect value of 1TiB set before genesis. The parameter was set higher than available RAM on replicas. The increased workload on multiple subnets jtdsg, pjljw, mpubz, brlsh caused the heap delta to grow to 512GiB and crash with out-of-memory. Two subnets pjljw, mpubz had to be disaster-recovered.
Affected subnets could not process update or query calls while the replicas were crash-looping. After halting, and before the subnets were disaster recovered, queries could be processed, but update calls would not work. Once the subnets were disaster recovered, the subnets were fully operational. No state was lost.
2021-10-15 11:15: mpubz is back online. Incident is resolved.
2021-10-15 11:30: Status Page: This incident has been resolved.
What went wrong?
There were no alerts for replica restarts. The only alert that fired was one for slow finalization, which also fired frequently in the past.
Heap delta limit was configured incorrectly. It was assumed that under high memory pressure some memory pages would be flushed to disk but this did not happen.
One small instance of OOM crash happened two weeks ago but wasn’t investigated and followed up on.
Manual steps in the process (e.g. copy-paste hash versions and subnet names) led to a couple of mistakes (wrong NNS proposals) while making NNS proposals to fix the remaining subnets.
Besides telemetry and monitoring, it would be incredibly useful if the DFINITY team had canisters on each subnet that allows the team to test how each subnet is behaving (there was already a project working towards this already).
What went right?
Halting the affected subnets allowed them to go to read-only mode instead of both reads and writes being broken.
Disaster recovery of affected subnets went smoothly.
The proposals to mitigate the incident were quickly adopted by the community.
The execution layer needs to keep track of memory changes made by canisters during message execution until the next checkpoint when they can be flushed to disk. In this case, the increased workload meant that there were too many such changes, and the incorrect limit for those changes meant that replicas started running out of memory and crashing.
Add alerts for replicas restarting (outside normal operations, like a subnet upgrade).
Add performance tests for the memory workload on multiple canisters in parallel.
Ability to stop message processing via an NNS proposal without halting the subnet. Catch Up Packages (CUPs) are still generated and nodes that are behind can still catch up.
Alerts for when a replica is under memory pressure
Awesome AAR, @diegop! I really appreciate this sort of report. It brought up a couple of questions in my mind.
The “subnet debug” canisters you mentioned in the final bullet of ‘What went wrong?’ sound like a good idea. I worry if these subnets would need to have special permissions to gain access to some subnet metadata of interest during debugging, these canisters could become a target for attackers. (Feels similar to the way k8s pods have access to the k8s runtime, which is a significant attack surface). If no special permissions/API will be needed for these canisters, then I’m ALL FOR the “subnet debug canister” idea. Offering it as an OIS would be SUPER interesting. I can see ic.rocks or the dashboard displaying granular real-time subnet metrics. I can also see dApp devs using that OIS to do application scaling, based on dev-defined QoS metrics.
In thinking long into the future, with a fully decentralized Internet Computer operating outside of the control of the Foundation or ICA, I wonder about the “decentralization” of the IC triage team that handles these sort of abstract technical issues. I presume that node operators will each have their own IT staff for hardware related issues. But, down the line, how will the team of specialists that deal with issues such as this be managed? Sorry if this is explained elsewhere. Just send me a link, if you will.
This is a fair point. This is very much the intent and description on this has been spread out over many communications and not very clear, to be honest. We are working on more communication and explanation on the road towards getting us there (to be clear, this is on my pile and something I have been touching here and there, but need to finally help ship/publish with the aid of the teams who are doing the real hard work)
We will be setting up a designated canary canister on each subnet which can be used for external probing (very soon) that can be used for alerts and status by anyone. We don’t anticipate that any special permissions will be required (except in so far as some subnets e.g. which are being qualified, may require special permission to create the canary canister).
Is this possibly a mischaracterization? Maybe I missed it, but I wasn’t aware of any of these proposals (I wasn’t watching but even if I were watching I’m not sure how much information I would have had). Seems more like DFINITY took control of the whole process, had all of the information, created all of the proposals, and simply voted on them.
I don’t see any community involvement here, and right now it seems DFINITY effectively has full control over the NNS.
Perhaps this is a discussion for another thread, but how can we actually decentralize the NNS? Even the community roadmap proposals are a nice gesture, but we don’t have any real power. The number of votes from community members is abysmal compared to when DFINITY comes in to finish off the roadmap votes.
And maybe I misunderstand the relationship between DFINITY and the ICA, but seems there is a very very small group of people that are actually clicking buttons to push massive changes through the NNS.
I guess a more real accurate is that enough of the community trusted DFINITY by following the foundation so that these things could move quickly since the foundation does not have anywhere enough votes by itself.
ironically, this is something we are actively trying to water down by having more people follow other neurons, but we were lucky in this instant
Ah, so DFINITY doesn’t have enough votes alone to push proposals through? How can I verify this?
If that’s true I’m happy with that since yes you would assume those with the votes have chosen to delegate to DFINITY or the ICA.
I think my problem is that I’ve only seen DFINITY say in blog posts or Dom say in tweets that DFINITY and/or the ICA have fewer than 50% of ICP. But can I somehow verify this or get close by querying some canisters/neurons?
This is a great question. There are articles and other writings, but speaking from a trustless crypto POV, maybe the best way would be to see how much voting power for the proposals came from following. I am not sure this is exposed in an easy way tbh.
I will push this feedback up to the NNS team and others.
Huge sidenote… if this is surprising to you, Jordan… then we have done A TERRIBLE job communicating this. I am honestly embarrassed by my communication skills if you thought the foundation had enough voting power to push motions through by its mere votes and without following. I definitely accept responsibility here.
Well don’t feel bad, I know it’s been mentioned in blog posts and elsewhere, but honestly I have a hard time believing it without some more verification.
It feels like there is a very small group of people that have more than 50% voting power, so even if DFINITY has 25% and the ICA has 25% (not sure what the numbers are), who even composes these organizations? In my mind DFINITY and the ICA are one and the same.
I just don’t feel like I have enough information to trust the distributions right now.
I am not surprised that the inertia of trust/UI/economics means lots of people blindly follow the foundation (vs say CycleDao). Accelerating more diverse neuron following is definitely important to the foundation and ecosystem, and I think it is rapidly becoming more urgent.
what is surprising to me (and great feedback) is to at least break down where voting power from proposals comes from. Maybe NNS team has it in its backlog, who knows. but i think that is a very good idea.
I have thought about how this diversification of neuron following may occur, or what it may look like ‘in the wild?’ My initial thoughts are that it would likely be achieved through a sort of political party/politician sort of mechanism. An individual could stake a neuron (II is thought to be achieving PoH through people parties, right?), and claim they are a ‘politician’ or ‘governor’ and that they will be voting on and submitting proposals surrounding a specific topic/community that they feel empowered about. That would likely form out of groups of devs that use the IC, and/or core team members of large dApp teams. This same sort of ecosystem would likely form and exist beneath each SNS DAO with DAO Ambassadors, as well. $0.02