Watch out for foot guns with canister upgrades

akhilesh.singhania · November 19, 2021, 2:38pm

In general, the canister upgrade story still has a lot of foot guns that need addressing. I want to highlight the following known foot guns. I may have missed some

Bugs in `pre_upgrade` hooks

If there is a bug in your pre_upgrade hook that causes it to panic, the canister can no longer be upgraded. This is because the pre_upgrade hook is part of the currently deployed wasm module and the system will always execute it before deploying the new wasm module and if the pre_upgrade hook fails, then the system will fail the whole upgrade.

Currently we do not have a good mitigation around this issue other than urging developers to make sure that their pre_upgrade is bug free by doing a lot of testing.

Long running upgrades

Generally speaking, when a canister is being upgraded, the logic in the pre_upgrade hook serialises state from the wasm heap to stable memory and the logic in the post_upgrade hook deserialises it from stable memory back to wasm heap. There is an instructions bound on how long the upgrade process can run for. So it is possible that if the canister has too much state or the [de]serialising logic is not very efficient, then the whole process does not finish in time.

The recommended mitigation here is to ensure that the state that needs to be persisted across upgrades does not exceed what the canister can [de]serialise during the upgrade process.

[de]serialiser requiring additional wasm memory

Related issue in Motoko: GC: Reserve Wasm memory for upgrading canisters · Issue #2909 · dfinity/motoko · GitHub. Generally speaking, it is possible that the serialising logic requires some additional wasm heap to run. Let’s say that the canister has 3.5GiB of wasm heap and the serialising logic requires an additional 600MiB to serialise the data, given that the wasm heap is limited to 4GiB, the upgrade process will again fail. Note that this issue will also be present for canisters written in Rust.

The recommended mitigation here is to again ensure that the state that needs to be persisted across upgrades does not exceed what the canister can [de]serialise during the upgrade process.

Planned features

We are continuously thinking about designs and improvements that we can make to address the above foot guns and balancing that with working on other various high priority projects. Some features that I am hoping that the team can prioritise working on in the near future are listed below. Note that the design for these features is not worked out at all and I may not be able to answer all questions related to them just yet.

Allow developers to download / upload canister state

Despite all the testing that a developer may do, they may still end up with a bricked canister. At this point, the least that the platform can do is allow the developer to download the state of the canister for backup. There are already existing developers like @rckprtr who are building this functionality into their canisters so that they can always backup their data.

Deterministic time slicing

Programming against a platform where messages have a bound on how long they can execute for is quite complicated. This is difficult not just for upgrading the canisters but also for general message execution. The idea of this feature would be that when a message hits the execution limit, instead of failing it, we pause execution, let some other canister execute for a while and then resume execution later. This way we could in theory let messages execute for arbitrarily long.

jzxchiang · November 20, 2021, 12:23am

If there is a bug in your pre_upgrade hook that causes it to panic, the canister can no longer be upgraded.

Do you mean that one upgrade would fail (and revert the canister to its previous state, with all its previous data intact), or that the canister would also permanently lose its previous data?

Allow developers to download / upload canister state

This would be so useful! Even for local development.

nomeata · November 20, 2021, 2:19pm

Maybe worth adding to this list that (most) canisters must be stopped before upgrading, but that can be delayed (or even be impossible) depending what kind of canisters you call?

We can make it safe enough to upgrade canisters without stopping in some cases, but it’s yet another of those system API changes that are not super sexy and are competing with the many other important things waiting…

akhilesh.singhania · November 22, 2021, 8:16am

I mean that the canister will be reverted to its previous state with all previous data intact.

Indeed, there are multiple reasons to get this going!

Good points. Note that we are trying to get a wiki going. I am going to make a list of current “limitations” on it and use this thread as to seed it.

anthonymq · November 22, 2021, 10:42am

Oh really ? We have to stop the canisters before an upgrade ?
I’m following @rckprtr pattern to backup and restore my data until a clear solution is found.

akhilesh.singhania · November 22, 2021, 10:57am

Generally speaking, yes, it is a good idea to only upgrade Stopped canisters. Otherwise, it is possible that a Response from the previous version of the wasm module is executed against the newer wasm module and new state which may not be compatible and subtle corruptions could occur.

jzxchiang · November 24, 2021, 5:59am

Wow interesting. So upgrading stopped canisters is merely a recommendation and not a requirement, is that right?

Also, can you clarify what you mean by Response?

akhilesh.singhania · November 24, 2021, 8:59am

I personally think that it should be a requirement. AFAIK, no compiler can guarantee that the new wasm will be compatible with the older wasm, so no realistic wasm module (i.e. not hand crafted) can manage this.

All messages between canisters must be either Requests or Responses. See Smart Contracts on the Internet Computer and Smart Contracts on the Internet Computer for more details. When a canister calls ic0.call_perform(), it is sending a Request to another canister. When a canister calls ic0.msg_reply() or ic0.msg_reject (when replying to a Request from a canister), it sends a Response.

Smart Contracts on the Internet Computer has some more discussion as well.

nomeata · November 24, 2021, 2:06pm

I think that’s too pessimistic. Just because the two compilers we use right now can’t do this doesn’t mean that we should at least allow someone to do better - either improving the compilers, or maybe using postprocessing. And with a better system API (see other thread) it’s in reach for Rust.

The whole idea of having to stop a canister like this, and thus always have downtimes of unpredictable length, is just silly given our claims about the Internet Computer (always available, people can put important stuff on it…). I hope we can fix these problems, than continuing to only manage them.

(That said, now that we introduce custom sections in the wasm for IC-specific metadata, maybe we can consider a section that indicates whether the canister can be upgraded without stopping, to prevent foot guns.)

claudio · November 24, 2021, 3:10pm

Motoko will actually prevent an upgrade if the canister has pending call-backs.

akhilesh.singhania · November 24, 2021, 3:23pm

I was of course referring to all the wasm modules that I have seen in the wild and what our developers are building. IMO, having a more restrictive system initially and then relax the constraints when we have built sufficient capabilities is probably a more user friendly approach than having a less restrictive system with many footguns.

nomeata · November 24, 2021, 5:41pm

Foot gun guards are better placed in layers above the system, I’d say. And as Claudio points out Motoko does that - the rust CDK should probably too.

The problem with putting restrictions into the system is that it stifles innovation:

Assume the system would prevent such upgrades, and you’d be in the position of implementing the first CDK (maybe for rust, maybe for another language) that allows safe instantaneous upgrades. Now you have a killer feature, but you can’t even use it before you convince DFINITY to flip a switch in their code, with all the fluff and politics involved. (See canisters holding ICPs).

In contrast, assume the system is like it is now. Someone forks the rust CDK to provide safe instantaneous upgrades, their developers immediately benefit.

Plus, allowing immediate upgrades and reinstallation can be useful as matter of last resort (the bug fixed by the upgrade may incur higher risks than the possible state corruption, which can for a concrete case even be assessed by a wasm-reading person).

Plus, allowing these upgrades keeps having a safer API for that on the agenda , and keeps us on track to having canisters like the ledger (and many user’s canisters) upgradable.

(The ledger only sends notifies without caring about the result, just throwing them away, and makes no other calls? Then it would be easy to make it safely upgradable even with the current system API and compilers. I can expand if the ledger team would be interested.)

jzxchiang · November 25, 2021, 12:41am

Thanks. What’s the difference between ic0.msg_reply / ic0.msg_reject and the inter-canister call callbacks that are stored inside the canister table?

akhilesh.singhania · November 25, 2021, 9:06am

Let’s say canister A sends a request to canister B.

If you look at the arguments to ic0.call_new() here: The Internet Computer Interface Specification :: Internet Computer, the reply_fun and reply_env identify the function to be called if B replies to A using ic0.msg_reply and reject_fun and reject_env identify the function to be called B replies to A using ic0.msg_reject.

akhilesh.singhania · November 25, 2021, 9:07am

Looks like we are in agreement here. I am perfectly happy if it is the CDKs that are protecting the users from the footguns. The system should indeed be more expressive.

nomeata · November 25, 2021, 11:02am

Just checked, unfortunately the ledger does not just use “one-shot” calls, but has logic in the response handler. Too bad.

But we can still extract a pattern: if you want your canister to be upgradable anytime with zero downtime, try to structure your service that it only makes calls without caring about the response (e.g. a pattern of notify and explicit acknowledge). Then you can use the existing system API in a way that upgrades don’t need stopping (in a nutshell: pass an invalid table index, e.g. -1, for the callbacks. This will always safely trap, even after upgrade).

akhilesh.singhania · November 29, 2021, 9:17am

I am maintaining a list of current limitations of the IC on Current limitations of the IC - Internet Computer. I would love to clean up the discussion there and add more limitations to the page. Contributions are welcome!

jzxchiang · November 30, 2021, 5:46am

If you look at the arguments to ic0.call_new() here: The Internet Computer Interface Specification :: Internet Computer, the reply_fun and reply_env identify the function to be called if B replies to A using ic0.msg_reply and reject_fun and reject_env identify the function to be called B replies to A using ic0.msg_reject .

Gotcha, thanks. I had to read this sentence 3 times, but it makes sense now. I forgot that anything a canister does in terms of interacting with the outside world has to go through an IC system call. Hence, the distinction.

I am maintaining a list of current limitations of the IC on Current limitations of the IC - Internet Computer . I would love to clean up the discussion there and add more limitations to the page. Contributions are welcome!

Nice! I feel like it might be better if some of this is put in the official docs, especially the upgrades part. Also, more people may see it.

jzxchiang · January 25, 2022, 7:32am

One question of clarification:

If there’s a bug in the pre_upgade hook, then the canister will become permanently un-upgradable. Does that also happen if a canister’s pre_upgrade hook is functionally correct but exceeds the instruction limit for the upgrade?

I didn’t read this right the first time, but this definitely sounds like a huge problem, especially for Motoko canisters that rely on stable variables…

akhilesh.singhania · January 25, 2022, 9:35am

Yes both are problems. If the upgrades take too long to run then the canister cannot be upgraded and if there is a bug in the pre_upgrade hook then it cannot be upgraded. These are huge problems.

The team has managed to start working on a feature called time slicing which allows messages to execute for very long time without impacting the performance of the subnet which will solve the too long issue. We are going to announce the details on this feature soon after a couple of more internal reviews are done.

We still need to figure out how to handle the case where there is a bug in the pre_upgrade hook. Ideas are welcome.

Topic		Replies	Views
FEATURE REQUEST: If pre_upgrade fails, revert canister to earlier state snapshot of before pre_upgrade ran and discard current pre_upgrade logic Rust	39	1785	January 18, 2023
Pre_upgrade hook failed, what to do? Developers	1	34	January 4, 2025
Save canister state while upgrading WASM of canister Developers Functional-Programming , Discussing	9	445	October 17, 2023
PSA: Upcoming Wasm memory limit may break your canisters Developers	19	2362	August 28, 2024
Self-upgrading canisters Developers	7	545	June 26, 2023

Watch out for foot guns with canister upgrades

Bugs in pre_upgrade hooks

Long running upgrades

[de]serialiser requiring additional wasm memory

Planned features

Allow developers to download / upload canister state

Deterministic time slicing

Related topics

Bugs in `pre_upgrade` hooks