In general, the canister upgrade story still has a lot of foot guns that need addressing. I want to highlight the following known foot guns. I may have missed some
Bugs in pre_upgrade
hooks
If there is a bug in your pre_upgrade
hook that causes it to panic, the canister can no longer be upgraded. This is because the pre_upgrade
hook is part of the currently deployed wasm module and the system will always execute it before deploying the new wasm module and if the pre_upgrade
hook fails, then the system will fail the whole upgrade.
Currently we do not have a good mitigation around this issue other than urging developers to make sure that their pre_upgrade
is bug free by doing a lot of testing.
Long running upgrades
Generally speaking, when a canister is being upgraded, the logic in the pre_upgrade
hook serialises state from the wasm heap to stable memory and the logic in the post_upgrade
hook deserialises it from stable memory back to wasm heap. There is an instructions bound on how long the upgrade process can run for. So it is possible that if the canister has too much state or the [de]serialising logic is not very efficient, then the whole process does not finish in time.
The recommended mitigation here is to ensure that the state that needs to be persisted across upgrades does not exceed what the canister can [de]serialise during the upgrade process.
[de]serialiser requiring additional wasm memory
Related issue in Motoko: GC: Reserve Wasm memory for upgrading canisters · Issue #2909 · dfinity/motoko · GitHub. Generally speaking, it is possible that the serialising logic requires some additional wasm heap to run. Let’s say that the canister has 3.5GiB of wasm heap and the serialising logic requires an additional 600MiB to serialise the data, given that the wasm heap is limited to 4GiB, the upgrade process will again fail. Note that this issue will also be present for canisters written in Rust.
The recommended mitigation here is to again ensure that the state that needs to be persisted across upgrades does not exceed what the canister can [de]serialise during the upgrade process.
Planned features
We are continuously thinking about designs and improvements that we can make to address the above foot guns and balancing that with working on other various high priority projects. Some features that I am hoping that the team can prioritise working on in the near future are listed below. Note that the design for these features is not worked out at all and I may not be able to answer all questions related to them just yet.
Allow developers to download / upload canister state
Despite all the testing that a developer may do, they may still end up with a bricked canister. At this point, the least that the platform can do is allow the developer to download the state of the canister for backup. There are already existing developers like @rckprtr who are building this functionality into their canisters so that they can always backup their data.
Deterministic time slicing
Programming against a platform where messages have a bound on how long they can execute for is quite complicated. This is difficult not just for upgrading the canisters but also for general message execution. The idea of this feature would be that when a message hits the execution limit, instead of failing it, we pause execution, let some other canister execute for a while and then resume execution later. This way we could in theory let messages execute for arbitrarily long.