SNS failed canister upgrade proposal

Hey all, we’ve run into an error while completing a Nuance DAO canister upgrade proposal.

We completed the voting on proposal 148, however we found the canister stopped (for longer than usual) with the error:
Caused by: The replica returned a rejection error: reject code CanisterError, reject message IC0509: Canister 353ux-wqaaa-aaaaf-qakga-cai is not running, error code Some("IC0509")

Then after ~5mins the canister started, but did not upgrade to the expected version. The canister remained the previous version and the proposal status displayed “executed”.

Next we decided to try again with proposal 154, this time the upgrade succeeded and the canister moved to the correct version.

Even though we’re now on the correct version with proposal 154, we’re curious to get any insights on why this would happen, and to make it a known issue if it’s not already.

Here is the repo for 353ux-wqaaa-aaaaf-qakga-cai if needed:

Hi @Mitch ,
thanks for raising the question!

First, note that at the moment an upgrade proposal might say “executed” even if it actually wasn’t successful. The status “executed” basically only means that the SNS root received the request to upgrade a canister from the SNS governance. So this part is not concerning me, but we know this can be confusing and we are currently working on improving SNS upgrades.

I have to try to find out more about the error code to be sure what happened in detail.
But what I suspect is that the upgrade failed due to a canister that could not be stopped (had open call contexts) or even some queues that were full or something similar. If the root cannot get the upgrade through for too long (which I think could be the 5mins you described), SNS root will just start the canister again and give up. This is because retrying later makes more sense than waiting forever for an answer - as waiting would mean SNS root is now also stuck.

I assume that when you tried again later with a new proposal, the underlying problem was not there anymore (queues were empty or the target canister had all open call contexts returned and could be stopped). And therefore it worked the second time.

I hope this helps as a starting point!

2 Likes

Just to confirm, what Lara says is correct. The error code that you saw when you attempted to send a message to the canister is consistent with that explanation, as it mentions the canister was not running (it was in a Stopping state, waiting to be able to be fully stopped so it can be upgraded).

Canisters have to be stopped in order to be safely upgraded. Upgrading a canister without stopping it can have some dire consequences.

2 Likes