In general, the canister upgrade story still has a lot of foot guns that need addressing. I want to highlight the following known foot guns. I may have missed some
If there is a bug in your
pre_upgrade hook that causes it to panic, the canister can no longer be upgraded. This is because the
pre_upgrade hook is part of the currently deployed wasm module and the system will always execute it before deploying the new wasm module and if the
pre_upgrade hook fails, then the system will fail the whole upgrade.
Currently we do not have a good mitigation around this issue other than urging developers to make sure that their
pre_upgrade is bug free by doing a lot of testing.
Generally speaking, when a canister is being upgraded, the logic in the
pre_upgrade hook serialises state from the wasm heap to stable memory and the logic in the
post_upgrade hook deserialises it from stable memory back to wasm heap. There is an instructions bound on how long the upgrade process can run for. So it is possible that if the canister has too much state or the [de]serialising logic is not very efficient, then the whole process does not finish in time.
The recommended mitigation here is to ensure that the state that needs to be persisted across upgrades does not exceed what the canister can [de]serialise during the upgrade process.
Related issue in Motoko: GC: Reserve Wasm memory for upgrading canisters · Issue #2909 · dfinity/motoko · GitHub. Generally speaking, it is possible that the serialising logic requires some additional wasm heap to run. Let’s say that the canister has 3.5GiB of wasm heap and the serialising logic requires an additional 600MiB to serialise the data, given that the wasm heap is limited to 4GiB, the upgrade process will again fail. Note that this issue will also be present for canisters written in Rust.
The recommended mitigation here is to again ensure that the state that needs to be persisted across upgrades does not exceed what the canister can [de]serialise during the upgrade process.
We are continuously thinking about designs and improvements that we can make to address the above foot guns and balancing that with working on other various high priority projects. Some features that I am hoping that the team can prioritise working on in the near future are listed below. Note that the design for these features is not worked out at all and I may not be able to answer all questions related to them just yet.
Despite all the testing that a developer may do, they may still end up with a bricked canister. At this point, the least that the platform can do is allow the developer to download the state of the canister for backup. There are already existing developers like @rckprtr who are building this functionality into their canisters so that they can always backup their data.
Programming against a platform where messages have a bound on how long they can execute for is quite complicated. This is difficult not just for upgrading the canisters but also for general message execution. The idea of this feature would be that when a message hits the execution limit, instead of failing it, we pause execution, let some other canister execute for a while and then resume execution later. This way we could in theory let messages execute for arbitrarily long.