NNS Voting Rewards Distribution Stalled

msumme · February 11, 2025, 7:29pm

Hello everyone.

Today, it was noticed that no new reward event was created after the neurons were migrated to stable memory.

This is due to the increased instructions that it takes to make changes to neurons stored in Stable Structures.

Unfortunately, our testing missed this. We apologize for any inconveniences or problems this may have caused.

Impact

No neuron holders are receiving voting rewards.

The canister is less responsive, because it is using maximum instructions in each heartbeat while unsuccessfully attempting to distribute voting rewards.

Next Steps

We are preparing a hotfix to rollback the migration, which we were ready for in case on incidents, and we will be deploying it in the next 24 hours.

Followup

We will followup with a fix for the root cause, and do an additional review of the code before re-enabling migration to stable memory.

dfxjesse · February 11, 2025, 9:09pm

Thanks for sharing this and the proposed solution, I hope the fixes go okay. I found the forum post of this update that caused this issue: NNS Updates 2025-02-07

From looking at the thread it looks like it was reviewed by a few different people, I wanted to ask, how come someone didn’t highlight the increased instructions in the upgrade? Maybe I don’t understand what the reviews are for, but I browsed over them and they just looked like copies of the messages in the commit, one reviewer even wrote this:

The MIGRATE_ACTIVE_NEURONS_TO_STABLE_MEMORY flag has been set to true, initiating the migration of active neurons to stable memory. This change will enhance the stability and reliability of the system.

Looks like the upgrade done the exact opposite of this.

Not pointing fingers at anyone and I of course know that bugs slip through i’m just trying to understand the testing and reviewing of the things that impact so many users on ICP.

ZackDS · February 11, 2025, 9:22pm

As far as reviews go they try to highlight issues found between the commit summary and the actual code change, currently the Grants for reviews did not ask or cover running the canisters locally and testing them, that takes time and effort to complete, and would have surely missed this as well not having access to many things required to do so. Feel free to try out a review.

Reference : Verifying proposals.

dfxjesse · February 11, 2025, 9:24pm

Thanks for the explanation, sounds like you could just pop the code and commit summary into ChatGPT and it would check that for you in 30 seconds, are the reviewers paid to do this?

ZackDS · February 11, 2025, 9:35pm

If one knows how the different layers of the IC work AI can be used to speed things up, but would you trust AI on this ? I am not defending anyone just saying that if it was missed by the team that works on this reviewers would have a hard time finding this that obviously is not just part of one commit ?
How would you review commit a3ef9a5 ?
We welcome every constructive criticism and try our best to learn and improve.

Using AI to speed up calculating the actual increase for eg. add_neuron_active_maximum

dfxjesse · February 11, 2025, 9:40pm

Im thinking off the top of my head, but with each of these upgrades an NNS subnet state wasm could be provided that reviewers could pop into pocket IC, and they can then run a set of pre-made test suites (neuron staking, creation, modifying and creating proposals etc etc).

Things like instruction increases and stable memory can be monitored this way. I’m sure it would require more effort by all parties

ZackDS · February 11, 2025, 9:52pm

I agree with you 100% , and we should definitely use PocketIC, it was just that usually there are a lot of changes done in one commit, again not defending anyone but looking at the canbench yml file we can clearly see the increase in instructions that looked fine (LGTM) without any other context, with the upgrade to version: 0.1.9 and the only other change reflected here beside the obvious flag, was asserting that the governance heap contains zero neurons while the stable memory contains one in the migrations test.

Thank you for your input very much appreciated.

msumme · February 12, 2025, 1:10am

The hotfix proposal has been adopted. We are monitoring to make sure it gets back to normal. We expect it to take multiple hours to finish the migration of some neurons back to heap, and then voting rewards will be distributed.

https://dashboard.internetcomputer.org/proposal/135265

wpb · February 12, 2025, 2:29am

I’ve advocated before and will advocate again for funding more individuals and teams to become grant recipients for proposal reviews. With more eyes on code, perhaps these issues would be identified. All it takes is for one person to spot an issue. Maybe you could apply to become a reviewer in the future if the opportunity arises. Others certainly made a compelling case to review this Protocol Canister Management proposal topic in their application for the Grants for Voting Neurons. I’d like to see you and others have an opportunity to participate by increasing the number of grant recipients. Perhaps that will happen some time soon.

msumme · February 12, 2025, 5:42am

To give a quick update - the migration of active neurons back to heap memory is complete, and we will investigate the cause and put out a fix before re-enabling the migration.

Additionally we will be figuring out how to make our tests more robust to catch this and things like this in the future before they go into production.

1eo · February 12, 2025, 9:24am

Approve 135265

Proposal

Canister id rrkah-fqaaa-aaaaa-aaaaq-cai is indeed the governance canister.
The upgrade args correspond to the empty args.
The install mode is indeed upgrade.
The wasm hash is reproducible.

Code Review

This proposal sets the flag MIGRATE_ACTIVE_NEURONS_TO_STABLE_MEMORY back to false. I would follow up by incorporating instruction measurements and processing neurons in batches. Any operation that iterates over all the neurons will have the same problem.

For example, the index canister profiles cycle consumption for certain operations due to previous issues with cycle usage. We utilized the following crate to measure actual cycle consumption: Canister Profiler.

Zane · February 12, 2025, 3:56pm

Proposal 135265 - Zane | CodeGov

Vote: ADOPT

Reason: Build completes successfully, both hashes and reviewed commits match their descriptions.

[345790def6] Modified should_distribute_rewards to early return until almost all active neurons have been moved back to the heap.

[b6578aca3b] Revert a3ef9a55e4 so that schedule_adjust_neurons_storage migrates active neurons back to the heap.

About CodeGov

CodeGov has a team of developers who review and vote independently on the following proposal topics: IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management. The CodeGov NNS known neuron is configured to follow our reviewers on these technical topics. We also have a group of Followees who vote independently on the Governance and the SNS & Neuron’s Fund topics. We strive to be a credible and reliable Followee option that votes on every proposal and every proposal topic in the NNS. We also support decentralization of SNS projects such as WaterNeuron, KongSwap, and Alice with a known neuron and credible Followees.

Learn more about CodeGov and its mission at codegov.org.

Zane · February 12, 2025, 4:24pm

Maybe I don’t understand what the reviews are for

The main purpose of the reviews is to make it much harder to sneak in changes that would compromise the credible neutrality of the network, so that’s where our focus is spent mostly.
Of course we also try to catch potential issues/edge cases, which has happened in the past, but that is a time consuming process as the code is written and reviewed by very competent engineers and is often shipped with tests covering the most obvious cases, so doing both type of verifications for all commits isn’t feasible with the current resources at our disposal.

Im thinking off the top of my head, but with each of these upgrades an NNS subnet state wasm could be provided that reviewers could pop into pocket IC, and they can then run a set of pre-made test suites (neuron staking, creation, modifying and creating proposals etc etc).

That’d be nice and not just for reviewers, but I doubt DFINITY would ever publicly provide the NNS subnet state for privacy reasons. Perhaps they could scramble some of the data to prevent leaking sensitive information.

msumme · February 13, 2025, 10:12pm

Hi all. I wanted to give a quick update on what happened.

The issue was not something that was in the code that was reviewed, but rather something that was missed during development, and therefore never would have showed up in the set of diffs. This would make it doubly difficult for a reviewer outside of engineers working actively on the feature to catch it.

Our analysis of the code focused on use cases where many neurons would be accessed, and we systematically traced the structure of the code to find those places that accessed the neuron iterators. One such place is ballot creation, and of course we benchmarked this code path to ensure we would be able to continue creating proposals.

Unfortunately due to the indirection, we missed the other place where that same list of neuron ids was used in rewarding neurons, as it used the information stored in the ballots, instead of directly accessing the collection of neurons.

This resulted in a situation where heartbeats ran out of instructions attempting to provide rewards, which would only have been identified as a problem if we hadn’t made the initial mistake of not collecting it in our list of code paths that needed to be measured and optimized.

Given the difficulty of this migration, we planned ahead with fallback mechanisms, and we thoroughly tested to ensure that we would be able to roll back in case something was discovered.

These critical paths worked, which I believe is the most important thing. We did not lose the ability to pass proposals. We did not lose data. And we were able to roll back safely. Not all mistakes can be prevented, but we can make them less dangerous and impactful.

We work continuously to improve our processes and practices, and we use these incidents as opportunities for growth and reflection about how we work. To that end, beyond just fixing the issue we discovered, we are looking at ways to further isolate the effects of bugs and failures, so that they are even less impactful to our user community.

Thanks for your patience during the time it took to resolve this incident. It is great to have a community this engaged and passitionate about the IC.

wpb · February 14, 2025, 12:03am

Thank you so much for keeping the community updated regarding this issue @msumme. I’m always impressed with how professionally DFINITY handles communication on issues that arise. I appreciate the heads up as soon as the issue was identified as well as the thorough review and explanation regarding what happened. Well done!

cyberowl · February 14, 2025, 9:16am

Proposal 135265 - Cyberowl | CodeGov

Vote: ADOPT

Reason:
Revert data migration .

A significant amount of effort was invested in migrating this data to stable memory. I commend the team for implementing robust mechanisms that allow changes to be reversed without any data loss. It’s reassuring to know that you were able to roll back safely, and I’m interested to learn about the modifications planned for the next migration release.

ZackDS · February 16, 2025, 4:25pm

Proposal #135265 for Governance — Zack | CodeGov

Vote: Adopted

Reason: Builds fine and the wasm hash is a match.

HOTFIX to revert neuron migration to stable memory along with pausing the rewards calculation, in order to speed up the migration back to heap.

About CodeGov

CodeGov has a team of developers who review and vote independently on the following proposal topics: IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management. The CodeGov NNS known neuron is configured to follow our reviewers on these technical topics. We also have a group of Followees who vote independently on the Governance and the SNS & Neuron’s Fund topics. We strive to be a credible and reliable Followee option that votes on every proposal and every proposal topic in the NNS. We also support decentralization of SNS projects such as WaterNeuron, KongSwap, and Alice with a known neuron and credible Followees.

Learn more about CodeGov and its mission at codegov.org.

LaCosta · February 19, 2025, 9:10am

Proposal 135265 – LaCosta | CodeGov

Vote: ADOPT

Revert data migration from the heap memory to stable memory due in order to fix a rewards calculation issue.

About CodeGov

CodeGov has a team of developers who review and vote independently on the following proposal topics: IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management. The CodeGov NNS known neuron is configured to follow our reviewers on these technical topics. We also have a group of Followees who vote independently on the Governance and the SNS & Neuron’s Fund topics. We strive to be a credible and reliable Followee option that votes on every proposal and every proposal topic in the NNS. We also support decentralization of SNS projects such as WaterNeuron, KongSwap, and Alice with a known neuron and credible Followees.

Learn more about CodeGov and its mission at codegov.org.

Topic		Replies	Views
NNS Updates: June 5, 2023 Developers nns	0	481	June 5, 2023
NNS Updates 2025-03-07 NNS proposal discussions nns , Protocol-canister-management	5	96	March 11, 2025
NNS Updates: Nov 10, 2023 (Matched Funding for SNS Swaps, Stable Memory Migration for NNS Neurons) NNS Governance nns , release , sns	2	1343	November 14, 2023
NNS Update: July 17, 2023 Developers nns	0	453	July 14, 2023
Rewards fell%50 General Discussing , community-consideration , nns	19	1823	October 5, 2022

NNS Voting Rewards Distribution Stalled

Impact

Next Steps

Followup

Approve 135265

Proposal

Code Review

Proposal 135265 - Zane | CodeGov

Proposal #135265 for Governance — Zack | CodeGov

Proposal 135265 – LaCosta | CodeGov

Related topics