Hi all. I wanted to give a quick update on what happened.
The issue was not something that was in the code that was reviewed, but rather something that was missed during development, and therefore never would have showed up in the set of diffs. This would make it doubly difficult for a reviewer outside of engineers working actively on the feature to catch it.
Our analysis of the code focused on use cases where many neurons would be accessed, and we systematically traced the structure of the code to find those places that accessed the neuron iterators. One such place is ballot creation, and of course we benchmarked this code path to ensure we would be able to continue creating proposals.
Unfortunately due to the indirection, we missed the other place where that same list of neuron ids was used in rewarding neurons, as it used the information stored in the ballots, instead of directly accessing the collection of neurons.
This resulted in a situation where heartbeats ran out of instructions attempting to provide rewards, which would only have been identified as a problem if we hadn’t made the initial mistake of not collecting it in our list of code paths that needed to be measured and optimized.
Given the difficulty of this migration, we planned ahead with fallback mechanisms, and we thoroughly tested to ensure that we would be able to roll back in case something was discovered.
These critical paths worked, which I believe is the most important thing. We did not lose the ability to pass proposals. We did not lose data. And we were able to roll back safely. Not all mistakes can be prevented, but we can make them less dangerous and impactful.
We work continuously to improve our processes and practices, and we use these incidents as opportunities for growth and reflection about how we work. To that end, beyond just fixing the issue we discovered, we are looking at ways to further isolate the effects of bugs and failures, so that they are even less impactful to our user community.
Thanks for your patience during the time it took to resolve this incident. It is great to have a community this engaged and passitionate about the IC.