Prioritize safe instantaneous canister upgrades

The current incident where a problem on one subnet means that possibly, the ledger cannot be upgraded puts a spotlight on a long-standing fundamental issue of our system, and hopefully is a wakeup call. I’d say it’s crucial that developers of an important canister like the ledger must be able to program it in a way to upgrade safely even with outstanding calls.

And we know how to do it:

  1. Change (or extend, for compat) the system API to have named entry points for the response callbacks. A spec proposal for that is floating around the relevant repository for a good while.
  2. Change the ledger canister to use the new interface. This will require not using await, but implement callback handlers explicitly, but for a canister like the ledger this is not a significant hurdle. In fact, it may make the code cleaner and easier to understand, and reduces the risks of await-related pitfalls.

It will likely not possible to use this from Motoko or from rust when using await, and that’s okay - it just must be possible to have safe instantaneous (i.e. no stopping) upgrade for those who can’t afford to have their upgradability at the whims of possibly malicious other canisters.

Yes, its not a sexy feature, and unsatisfying that it’s not compatible with async/await. But still sexier than getting stuck with an ungradedable ledger (and then having to resort to patch the replica to synthesize the responses, as happened before.)

12 Likes

Hear, hear. Seems like something that we should decide to make time for. If it would reduce drag on the process of upgrading the ledger, the sooner it’s done the more time that could be saved. Also, reducing the likelihood of a foundation-backed “full stop” event on a proposal seems quite valuable. Such occurrences are bad PR, and I’m curious to see if the voting turnout is as good the second time around.

1 Like

Can you explain why named entry points for response callbacks will solve the problem of upgrading canisters with outstanding call contexts?

My understanding is that right now inter-canister updates execute reply and reject callback functions that are stored in a WebAssembly Table, which itself is stored inside the callee canister’s wasm module. These callback functions are looked up using known table entry indexes (I think).

How does using entry points (i.e. ic0.reply_callback and ic0.reject_callback) actually help?

The core problem of:

Some canisters may not be able to make sense of callbacks after upgrades

still isn’t addressed.

Or are you saying that the upcoming work in safe canister upgrades will fix this because now the callback function is part of the callee canister’s Candid interface and can thus be statically analyzed for breaking changes?

1 Like

Thanks for the note @nomeata! I agree. I will take an action item to pull up the relevant PR and post it on the forum. I am still hoping that the ic-ref repo will be open sourced soon and then publishing such PRs will be easier.

2 Likes

Yes, lots of answers to @jzxchiang 's questions in that repo. Maybe I’ll wait if that becomes available soon, instead of typing long texts on the phone :slight_smile:

Have the ability for a controller to download the entire state of a canister and analyze/process it.

5 Likes

^ This please. Can’t stress how important this will eventually be.

3 Likes

Pre upgrade can fail, post upgrade can fail, chunking can fail… many ways to lose everything.

2 Likes

We have a feature request for that. I am hoping to be able to prioritise it next year.

6 Likes

Slightly related: Motoko canisters can currently easily be rendered un-upgradeable by a single trap in a callback (Motoko issue).

False alarm, sorry.