The current incident where a problem on one subnet means that possibly, the ledger cannot be upgraded puts a spotlight on a long-standing fundamental issue of our system, and hopefully is a wakeup call. I’d say it’s crucial that developers of an important canister like the ledger must be able to program it in a way to upgrade safely even with outstanding calls.
And we know how to do it:
Change (or extend, for compat) the system API to have named entry points for the response callbacks. A spec proposal for that is floating around the relevant repository for a good while.
Change the ledger canister to use the new interface. This will require not using await, but implement callback handlers explicitly, but for a canister like the ledger this is not a significant hurdle. In fact, it may make the code cleaner and easier to understand, and reduces the risks of await-related pitfalls.
It will likely not possible to use this from Motoko or from rust when using await, and that’s okay - it just must be possible to have safe instantaneous (i.e. no stopping) upgrade for those who can’t afford to have their upgradability at the whims of possibly malicious other canisters.
Yes, its not a sexy feature, and unsatisfying that it’s not compatible with async/await. But still sexier than getting stuck with an ungradedable ledger (and then having to resort to patch the replica to synthesize the responses, as happened before.)
Hear, hear. Seems like something that we should decide to make time for. If it would reduce drag on the process of upgrading the ledger, the sooner it’s done the more time that could be saved. Also, reducing the likelihood of a foundation-backed “full stop” event on a proposal seems quite valuable. Such occurrences are bad PR, and I’m curious to see if the voting turnout is as good the second time around.
Can you explain why named entry points for response callbacks will solve the problem of upgrading canisters with outstanding call contexts?
My understanding is that right now inter-canister updates execute reply and reject callback functions that are stored in a WebAssembly Table, which itself is stored inside the callee canister’s wasm module. These callback functions are looked up using known table entry indexes (I think).
How does using entry points (i.e. ic0.reply_callback and ic0.reject_callback) actually help?
The core problem of:
Some canisters may not be able to make sense of callbacks after upgrades
still isn’t addressed.
Or are you saying that the upcoming work in safe canister upgrades will fix this because now the callback function is part of the callee canister’s Candid interface and can thus be statically analyzed for breaking changes?
Thanks for the note @nomeata! I agree. I will take an action item to pull up the relevant PR and post it on the forum. I am still hoping that the ic-ref repo will be open sourced soon and then publishing such PRs will be easier.
Yes, lots of answers to @jzxchiang 's questions in that repo. Maybe I’ll wait if that becomes available soon, instead of typing long texts on the phone
No, but with the Interface Specification repo accepting public contributions, we can at least do some steps towards that goal without relying only on DFINITY devs. This PR is a first step, but I still need to write down how named callbacks would look like:
The callback method names would be separate from the public methods (they receive different data, and it would be dangerous if they could be called by others nilly-willy), so it requires a new kind of entry point to the canister. No rocket science, but just needs doing. (The needs doing would be easier to fuel with motivation if I knew that DFINITY would actually plan to implement it, but writing it down may increase their motivation to do it, so we’ll see :-))