FEATURE REQUEST: If pre_upgrade fails, revert canister to earlier state snapshot of before pre_upgrade ran and discard current pre_upgrade logic

Thank you for the alternative approach.

This thread is more of a request to provide better protocol level support rather than getting the developer to adopt clever workarounds that are not officially (official → developer docs) prescribed

I meant something like the following:

#[pre_upgrade]
fn pre_upgrade() {
  persist_state();
}

#[post_upgrade]
fn post_upgrade() {
  restore_state();
  #[cfg(feature = "test_upgrade")]
  persist_state();
}
2 Likes

From an addition by subtraction perspective, perhaps the docs should be changed and pre and post should be deprecated. In hindsight they encourage poor data management practices. They make some sense if you were designing the protocol from scratch, but in hindsight they were an unnecessary complication of the protocol level.

2 Likes

Quick question, is there anything stopping me from directly calling the pre_upgrade function directly from post_upgrade? Instead of an intermediate persist_state function?

Also, what’s the expectation in terms of computations allowed with this approach? If this is a prescribed workaround, could we have some estimates of how much data we are limited to serializing and deserializing in a single go? Since the post_upgrade effectively performs both the functions now?

I think cloning/forking is a very useful feature, but the following seems still holds:

It seems on top of the cloning/forking, we still need to be able to fix the pre-upgrade code, something like install_code(mode = replace, wasm_module = ...) which would just replace the WASM module, without calling any hooks or touching the canister heap.

So the canister recovery process would be:

  1. Clone a canister at version 1.
  2. Try to upgrade the clone to version 2, invoking pre-, start and post-upgrade hooks as normal.
  3. If the upgrade fails, replace version 1 with a fixed version 1’ (no hooks invoked) and try step 2 again.
  4. Once everything goes well, upgrade the primary instance, applying the same fix before the upgrade.

Yes, it’s hard to produce a binary with exactly the same memory layout. But IMO using reproducible builds, it might be possible to fix many silly mistakes/typos in pre-upgrade handlers.

Also, the replacing WASM module might be a completely new application. For example, it might be a special downloader, which just allows to download the canister state chunk by chunk for offline data recovery. So, the final step in the recovery process might be:

  1. If there is no way to fix canister version 1, replace canister code with a state downloader, and download the state for offline recovery.
3 Likes

As I explained above, changing the code generally invalidates the heap, even if you just replace the pre-upgrade hook. We can never assume that works. The more likely outcome is that replacing the Wasm code irreversibly destroys the canister and all its heap data. It’s like attempting open-heart surgery with a saw.

hoi Andreas,

Right, and I pointed it out:

To be more concrete, I remember two cases when a canister pre-upgrade method was trapping in production:

  1. The pre-upgrade handler was exceeding the size of the stable memory, as the ic0.stable_grow() was not called.
  2. The pre-upgrade handler was violating the contract, as it was calling an unsupported System API function.

It’s not a representative sample, but in both those cases the fix would just add or remove a single System API call in pre-upgrade.

From my experience, adding/removing a System API call from the pre-upgrade hook in Rust does not change the memory layout nor invalidate the heap.

The reproducible builds are quite easy with docker, but even it was not used, still the WASM binary could be decoded to WAT, the System API call could be easily located and removed, and the WAT could be encoded back to WASM with the same memory layout.


Also, as the canister cloning opens a way to snapshot the canister state, the code replacement opens a way to backup/restore the canister state locally:

  1. Replace the canister code with the state downloader.
  2. Download the canister state chunk by chunk locally.

and then to restore it:

  1. Replace the canister code with the canister state uploader.
  2. Upload the canister state from the local backup.
  3. Replace the canister state uploader with the actual canister code.

All the operations could be greatly facilitated by dfx:

  1. The dfx could provide a user friendly way to backup/restore the canister state, i.e. replacing canister code with downloader/uploader, manage chunks etc…
  2. The dfx could do a sanity check that the memory layout is the same by comparing data segments and function tables for the install(mode = replace, ...)
  3. The dfx could clone the canister state before each install/upgrade operation, providing a user friendly way to revert to the state before the install/upgrade even if the upgrade itself was successful.

So, canister state cloning feature combined with a feature to replace WASM module, keeping the heap could get us:

  1. A way to snapshot and backup/restore the canister state locally.
  2. A practical way to fix some (maybe even most) of the pre-upgrade issues.
  3. A better overall user experience and upgrade confidence.
3 Likes

I think this motoko-migrations repo is gold. I will definetly use it.

Something that surprised me in the main.mo is that a non-stable variable initialized from a stable one acts like a pointer to the stable one? Intuitively I thought it would have been a copy.

By example:

  stable var stable_struct = { var toto: Text = "whatever"; };

  // This acts like pointer on the stable_struct, or ?
  let non_stable_struct = stable_struct;

  public shared func updateToto(text: Text) : async() {
    // This actually updates the stable_struct in stable memory, doesn't it ?
    // Does this mean that there is no Text saved on the heap at all ?
    non_stable_struct.toto := text;
  };

So far I was avoiding using classes to be able to save everything in stable memory. But if that’s true that means I could just use classes which members are initialized from stable variables, and get the best of both world: OOP + stable memory. :exploding_head:

1 Like

Yes! If you want to use “stable” your classes just need to be functional in nature. It took me a good long while to get over that understanding hump. It is annoying to put that ref in every parameter set, but it works well.

1 Like

Sorry for the late reply.

Again, even the smallest changes might invalidate the heap. To give a simple example, even changing a single literal (number, string) in the code can already be a breaking change, because it might reside in the heap. Let alone adding or removing one, e.g., as argument for an extra call. In principle, even adding a call with no arguments can affect the heap.

While there might be cases that happen to work, that is entirely incidental and totally language, compiler, version, mode dependent. Hence I’m afraid it would be impractical to try coming up with reliable, stable(!), and general rules about what is guaranteed to work and what’s not, even with reproducible builds, even for Motoko, let alone for compilers we do not control, like Rust.

And there would be no way to safe-guard against mistakes either, and such mistakes would most likely result in fatal and permanent corruption: in the best case, the canister will be immediately destroyed and crash; in the worst case, you won’t even notice the data corruption until much later.

1 Like

It’s hard to debate here, as I completely agree with everything you wrote, Andreas: there is no generic way to guarantee the heap is still valid.

My point is rather practical: if we have a choice between a bricked canister with no way to recover the data or a practical way to fix it…

Leaving the toolchains aside, dissassembling the code using wasm2wat, adding/commenting out a system API call, and assebling it back with wat2wasm. Do you agree the wasm heap will be still valid in this case?

2 Likes

Of course, if you hand-craft assembly code then you can make it work. But that won’t help normal devs.

If we can’t provide devs with any sufficiently safe, practical recipe (or even a way to check!) that won’t nuke their canisters forever, then it’s clearly not a feature that would be responsible to put out. Especially if the likelihood of nuking it is as high as here.

A corrupted canister is much worse than a non-upgradable one. With the latter, you at least have some chance to still retrieve data.

2 Likes

Right, and that would solve 100% of canister pre-upgrade issues I know.

Again, fully agree. The code replacement must be implemented only once we have canister snapshotting/cloning in place, so there is an easy way to roll back.

Despite it’s complexity and unsafety, IMO this approach would be still way better than what we did to fix those two pre-upgrade issues…

1 Like

If this works as described, then this would solve our scenarios too where we got stuck with unupgradable canisters. Again, this mechanism of manual assembly coding should be annotated with the mandatory “HERE BE DRAGONS” warnings, but if canister developers do get stuck in an unupgradable loop, at least they have a way out.

Because as long as we have the existing likelihood of canister devs getting stuck with the pre_upgrade hook, a lot of them will. And having a way of getting unstuck would help.
Most canister devs would be blissfully unaware of this footgun until they run into this for a failing pre_upgrade.

1 Like

The percentage of problems that can realistically be fixed by hand-editing assembly is likely extremely small, especially for code produced by a high-level language compiler. At the same time, the percentage of devs that will destroy their data in an attempt of doing so is probably orders of magnitude higher, especially among those that already did not assess the original upgrade problem.

This would be like handing out a bone saw to everybody and tell them to try opening their thorax in case of a heart attack. How many heart surgeons are there among the general population, and what’s the likelihood of survival? And is that the solution for making driving safer?

1 Like

My experience is exactly the opposite. Again, it’s not representative at all…

While IMO it’s an underestimation of software engineers’ skills, I won’t argue here. But there is one important point: freedom.

If I’ve got a heart problem, I’d rather find a surgeon with the right tools to fix my heart. But I have the freedom to do the surgery myself.

vs

On average, people have no skills to do heart surgeries, so if I’ve got a heart problem, I just die. And next time, I better be born with a flawless heart…

N.B. in both pre-upgrade cases I faced, the bug was introduced by a seasoned engineer. People make mistakes.

1 Like

If I’ve learnt one thing then it’s to never overestimate my own software engineering skills, no matter how experienced I am.

Well, exactly. If somebody already made mistakes with the original program, where they had plenty of opportunity to test, why have confidence to get it right for a critical situation under way more difficult circumstances? Keep in mind, there would be no real way to test it, it’s a one-time chance in general, and if they screw it up just a little bit, years of data would be lost forever.

No, a patient in a coma at least isn’t entirely lost, and there might be ways to revive them.

1 Like

Ah no, I just mean a better analogy to “a random person has to do a heart surgery” would be “a random doctor has to do a heart surgery”…

If they screw it up, they roll back the changes and they can try again…

Right, there will be always an option to get back into the coma…

If the code update works, the new code runs “successfully”, but corrupts the data without failing the upgrade itself, then how would you roll that back? I think your patient is dead. This is not an unlikely scenario at all.

2 Likes

We agreed in this thread above that the “replace code” feature should not be standalone but implemented along with the canister snapshotting/cloning/forking.

Here Roman suggests a syscall to clone the canister and its state, but it could be also implemented as a take/revert snapshot, TBD.

The recovery process for non-upgradable canisters would be:

  1. Prepare a binary with a fixed pre-upgrade (using wasm2wat or other tools).
  2. Take a canister snapshot (dfx could take it automatically).
  3. Replace the canister binary with a fixed one, keeping the old heap (dfx could allow only tiny changes between the old and fixed binaries)
  4. Upgrade the canister as normal. The fixed pre-upgrade should succeed now.
  5. The newly deployed canister could initiate a data validation process.
  6. If something goes wrong, revert the state to the snapshot and try again.

If there is no way to fix it, there is a way to download years of data:

  1. Replace the non-upgradable canister binary with a special state downloader, keeping the old heap.
  2. Download the heap chunk by chunk.
  3. Recover the data offline.
  4. Re-install the canister as normal with a new version (the heap will be lost at this point).
  5. Upload the recovered data.

Note, the state downloader does not use the WASM heap. All it does is ic0.msg_reply_data_append() a chunk of the old heap and then just ic0.msg_reply().

1 Like