Prelude: This post has a real financial impact(I think potentially on the order of ICP 100k+ of maturity) across a number of people, which can raise the temperature of the room. These kinds of situations are inevitable as the IC grows and I’d encourage everyone to approach this as a learning experience that raises some important questions about how we approach interoperability and upgradeability.
TLDR: ICDevs missed some votes because the Interface and Types for NNS proposals changed and I make suggestions for how to keep that from happening in the future with the NNS and in the hot IC utility you are building for you DAO.
Between September 8th and September 29th, ICDevs missed a number of Votes. This post is to explain what happened and how to keep it from happening in the future, and to initiate a discussion around a world of global system interoperability and the demands that are placed upon software architecture.
Background: In March of 2023 ICDevs wrote and deployed an “eventually reject” canister. An update to ICDevs Named Neuron Voting. This canister operated on a timer and would call the NNS System APIs asking for a list of open votes. The timer went off every 8 hours. If the canister found that a proposition was within 12 hours of closing, it would vote to reject the proposal.
This was set up for a few practical reasons. Voting takes a lot of attention. As a one-person org, I like to take vacations, and sometimes these last longer than 4 days. I try to keep up with voting even when on vacation, but the thought of one slipping through the cracks was always a bit anxiety-inducing.
The second reason was that because we have to reject SNS proposals and we want to give the community as much time to cast their votes as possible, we wanted to wait a reasonable amount of time before rejecting those. 3.5 days seemed like enough time. This system worked great for a number of months and you can go back in time on https://nmiv5-haaaa-aaaam-abgaa-cai.raw.ic0.app/ and see the number of times the canister saved my bacon(or that we pocket vetoed an SNS vote).
A couple of weeks ago, @wpb pinged me to ask why ICDevs hadn’t voted on one of the SNS proposals. We did not abstain from voting on purpose. I began to investigate and it looks like the canister stopped resetting the timer around September 8th.
This was a bit frustrating as several votes had been missed. Missed votes mean missed rewards. I created the following thread to try to figure out what went wrong Reason/Possibility for timer being canceled by replica? - #6 by skilesare. My determination at the time was that a network request must have timed out and that it was likely a one-time thing. I reset the canister and decided to pay a bit more attention to proactively voting.
When I went to check again, I found that the timer was broken again and it had not been running for almost 7 days. This was triggered when I had harvested the maturity from the ICDevs treasury and it was about 60% of what I had expected it to be.
First of all, I know a lot of people follow ICDevs and this has a real, material financial impact on people’s maturity when something like this happens. I owe everyone an apology that this was not caught sooner. My time has been stretched pretty thin and the grand plans of having a multi-person organization that provides broad coverage for things like this have been significantly impaired by the price action of ICP and what it has done to the 100% ICP maturity-based treasury that ICDevs operates on. I’d encourage everyone to make sure they set a broad set of followees (at least 3) so that if one does not vote, your votes still get cast.
Root Cause: Upon further investigation, I found that the issue that caused the canister to fail was the implementation of the One Step SNS proposal as well as changes to the list_proposal function that was published sometime after August 28th after the NNS canister upgrade was passed.
These changes to interfaces without maintaining backward compatibility are a bit head-scratching. It was my understanding, especially for system-based APIs, that any significant changes to APIs should require a new, versioned endpoint. When you change APIs you invalidate the interoperability that is generally promised but the IC’s architecture. If our canister had been blackholed we would have never been able to upgrade it to the new api. Fortunately, we had not done so and I’ve updated the canister with the new Governance Types. The full list of changes can be found at Updating Governance Types · icdevs/eventually_reject@1d4d74d · GitHub. It looks like several other additions and changes may affect other canisters that participate in governance(I’ll have to go back, but I suspect that axon is likely broken now as well).
So there are two issues at play:
- How to upgrade function signatures
- How to upgrade types
Either of them can break the compatibility of an automation canister.
First, let’s talk about function signatures. If you change a function signature that another application depends on, you break that application. That is bad enough if the application is a web app. You’ll have to rally and push out application changes. But what about Internet Computer services or utilities? Especially blackholed canisters or canisters under the control of DAOs? Maybe DAOs can vote to upgrade, but if the change is unexpected, your entire DAO app may be broken while you wait for the upgrade to pass. Blackholed canisters are just out of luck. You’re going to have to deploy an alternate and rally the community to consider the new canister as canonical, and if it is an important canister you need to hope that any dependent systems have a configuration variable so that they can change the canister id of the service.
To fix this you need to version your functions. For example, in the latest version of OGY Governance, we wanted to change the way stake balances were exposed. Instead of changing the get_balances function signature, we created a get_balances_v1_5 endpoint and made sure that the old endpoint continued to work.
Upgrading Types is another matter and I’m less confident in my opinion, but also far more opinionated here. Strongly typed programming is well-loved and cherished by programmers and it certainly can lead to cleaner, easier-to-debug code. But…it has serious consequences when it makes contact with the real world.
Sometimes a lawyer walks into the room and tells you that you have to do X, Y, and Z, and no debating about architectural integrity or difficulty of type refactoring is going to change his position. Sometimes you want to add a proposal type. What to do with all the dependent code and systems that rely on knowing the strong type of the systems they rely on? It is one thing to try to manage these when you are a singular organization; a different beast when you have interdependent defi being deployed, run, and operated by DAOs.
We’ve come up with some migration patterns to deal with type upgrades inside our canisters, but these don’t extend to dependent services. Our Eventually Reject canister doesn’t know what to do with a #CreateServiceNervousSystem request, and worse, it traps it in a place we can’t handle it…when the motoko code is parsing the response from the NNS server. (Not sure what rust would do here, but I’m guessing you don’t get to trap and handle…if I’m wrong, then maybe a solution here is to find a way for Motoko to trap candid mismatches.).
IC Axiom: If you are going to use strong typing and variants in a service you want other people to use and consume from automated canisters(think canister bots in defi) you better be sure you have all the possibilities baked in before you deploy because adding new variant types may not crash your upgrades, but they sure will crash the canister ecosystem that emerges around your services.
What is the solution? When extensibility may be possible, we might want to highly consider something like ICRC16(ICRC-16 CandyShared - Standardizing Unstructured Data Interoperability · Issue #16 · dfinity/ICRC · GitHub). This likely needs to be extended with things like schemas and transform libraries(similar to how we had xml, xsd, and xslt). By using dynamic types, dependent services can program what happens when a node in a data structure shows up that they are not expecting. Our eventually reject canister didn’t really even care about the type of proposal, it just needed to pay attention to the expiration date and proposal ID. Everything else was just noise to that particular use case. You don’t get to decide what use cases the users of your IC service will use and if you use strong typing that may change in the future you are tying one hand behind their backs.
We’ve gone a good way down using ICRC16-like Values in the ICRC3/7/8 working groups. (As an aside, we keep getting poked in the forms about ‘how can it take so long to come up with a standard’….this post is the answer. You have to try to think of EVERYTHING and you get to meet for an hour every 2 weeks).
These Values(especially if extended to ICRC16) allow us to represent almost any data structure in an extensible way….and if we want to add a transaction type in the future, the added item won’t break the function call from an indexing canister before the canister even gets its hands on the data. Don’t get me wrong, if you don’t know what you are doing, you can still break your canister and end up trapping somewhere because you don’t handle each possibility….but if you do know what you are doing you can fail gracefully and keep your canister up and running even if it can’t handle the newest hotness(which you likely don’t care about because your service existed before it did).
These two things cross over a bit as well. Note that updating a function input type with a new variant won’t break other applications calling your service as they just will never know to include the new variant. Since they don’t ever call it it won’t break. But reading out a new variant will break as soon as you have one of the new variants in your data. Don’t let ‘kinda safe’ into your code or it will bite you later on for a reason you aren’t thinking of now.
So in the future, I’d propose the following suggestions for the NNS team(and anyone else who is building anything resembling a utility or service on the IC).
Version your functions and maintain backward compatibility. (Dom said to: Open internet services, Williams explained, “can share functionality and data with other services using APIs that they cannot later revoke or degrade, enabling services to build on top of each other without having to trust one another, and providing for a real programmable web with incredible network effects.” - Dominic Williams on DFINITY’s Plan to Redesign the Internet | DFINITY | The Internet Computer Review)
Consider a type un-safe endpoints to interact with NNS governance that won’t break in the future when you add new governance types(maybe this has already been considered for SNSs that have dynamic and different proposal types(Looks like Generic Proposal types are used SNS proposals | Internet Computer … I’m curious why this type wasn’t used for the single shot SNS call?).
This would look like adding a list_proposals_type_unsafe_v1 function that returns Values according to defined schemas as we’ve done with ICRC3: https://github.com/dfinity/ICRC-1/tree/icrc-3/standards/ICRC-3#account-schema (But I think the full ICRC16 provides some more ICy like goodies that would be good long term).
I’m happy to hear any suggestions/comments/criticisms. Now that I know to look for these kinds of updates I’ll attempt to be more diligent in looking for changes to the schema as they will further break the eventual reject.