FEATURE REQUEST: If pre_upgrade fails, revert canister to earlier state snapshot of before pre_upgrade ran and discard current pre_upgrade logic

Right now, if pre_upgrade fails for whatever reason, your canister is stuck even if you try to push new canister upgrades. Only option is to do a complete reinstall. Which will WIPE ALL STATE. Which is not ideal, to say the least :sob:

Today, if your app holds any amount of significant user data, upgrades are a scary and gut-wrenching experience. Even with precautions and the stringent conventions being followed, it takes a simple slip up of making changes to memory layout of existing data structures to corrupt your entire app state.

What Iā€™m proposing is introduction of another hook, something like try_pre_upgrade that runs the logic in the hook. If everything works as expected, it will behave exactly like pre_upgrade currently does.

However, if thereā€™s a panic during the pre_upgrade step, it will simply revert to the last working checkpoint exactly how panics during update calls behave right now. I imagine this will require minimal changes to the platform since youā€™re borrowing behaviour from update calls on how to handle panics.

Additionally, add another upgrade mode to the existing upgrade, install, reinstall called lifecycle_upgrade that lets us upgrade only the lifecycle hooks in canisters while retaining their current heap. I imagine this can also borrow implementation logic from how canisters are put to sleep and woken up on the execution layer. Something around saving the entire heap memory contents to stable storage.

I imagine this entire feature can be incremental without requiring breaking changes to the existing APIs.

Thoughts?
@ulan

Tagging a couple of other devs who Iā€™ve recently chatted with regarding canister backups and upgrades.
@senior.joinu
Bruce from @AstroX
@icme

P.S This request is born from a real life use case for us at Hot or Not. I recently had to upgrade a canister with around 51 GB of memory used. The back story involves an implementation that was leaking stable memory. So, trying to migrate the data to heap memory required us to use the ic_cdk stable APIs. However, what we were unaware of was that those API are actually only limited to using 32 bit internal API since the heap maxes out at 4GB, thatā€™s fine. However, since our canister had breached those sizes, our preupgrade was failing without any way for us to recover that canister ultimately leading us to reinstall the entire canister.
Horror story that we hope to never have to repeat :sob:

4 Likes

Is there a reason why pre_upgrade canā€™t be changed to make it safe?

I donā€™t understand why weā€™d want to preserve the behavior of bricking a canister even if it traps.

2 Likes

Mostly for backward compatibility. But thatā€™s just my take. Iā€™m happy with a BREAKING change as well.

What Iā€™m proposing is introduction of another hook, something like try_pre_upgrade that runs the logic in the hook. If everything works as expected, it will behave exactly like pre_upgrade currently does.

However, if thereā€™s a panic during the pre_upgrade step, it will simply revert to the last working checkpoint exactly how panics during update calls behave right now.

Maybe I misunderstand, but a trap in the pre_upgrade hook already causes a rollback, so how is that different?

Additionally, add another upgrade mode to the existing upgrade , install , reinstall called lifecycle_upgrade that lets us upgrade only the lifecycle hooks in canisters while retaining their current heap. I imagine this can also borrow implementation logic from how canisters are put to sleep and woken up on the execution layer. Something around saving the entire heap memory contents to stable storage.

Of course that desire is understandable, and I agree that some solution is needed.

However, the fundamental reason why we have upgrade hooks in the first place instead of persisting the heap through upgrades is the fact that code changes can change the heap layout. That is a fact of life with almost every interesting compiler, including Rust and Motoko (e.g., because memory regions move, pointers change, indices shift, etc.). And it implies that new code can not use the old heap.

Unfortunately, adding, removing, or replacing upgrade hooks is a code change as well. In fact, even recompiling the exact same source code might change the heap layout if you are using a different compiler version or different compiler options!

Thus, we simply cannot assume that a heap image still is valid and safe to use after an upgrade, no matter how small the change.

If you need stable heap layout, then your only option currently is to use stable memory directly, and manually lay out your data in it. But then the responsibility is all yours that this layout remains backwards compatible. I would hope that the community would eventually build some decent libraries for that.

3 Likes

My understanding is that trapping in pre_upgrade leaves your canister unusable.

In Effective Rust canisters, @roman-kashitsyn writes:

if your pre_upgrade hook traps, there is not much you can do about it. Changing canister behavior needs an upgrade, but that is what a broken pre_upgrade hook prevents you from doing

Is the behavior of ā€œtrapā€ different in Motoko?

Thatā€™s precisely the semantics of pre_upgrade hook: if it fails, we roll back the canister state as if the upgrade never happened (modulo charging cycles). I donā€™t see how the try_pre_upgrade hook helps.

No, trapping in pre_upgrade rolls back to the state before the upgrade. Panicking in pre_upgrade usually makes your canister non-upgradable. From the same article you mentioned:

The pre_upgrade and post_upgrade hooks appear to be symmetrical. The canister returns to the pre-upgrade state if either of these hooks traps. This symmetry is deceptive.

1 Like

So, till now I have been assuming trap and panic are the same thing on the IC. Can you clarify how they are different? Also, pointing to any relevant documentation would be really helpful.

Update canister calls on the IC roll back to their last working checkpoint without becoming unrecoverable. Shouldnā€™t the pre_upgrade behave the same in case of a panic, then?

Secondly, this panic that Iā€™m referring to originated in the ic_cdk, so Iā€™m going to raise the fact that in this case, even ic_cdk developers from Dfinity also do not clearly distinguish between them.

For the context, we had a canister whose memory usage had reached 60GB, which included both stable and heap. And we were trying to move only the heap to stable storage in a pre_upgrade. The code looked like this:
image

However, we were unaware (since this is an outlier scenario and not explicitly documentend anywhere) that the ic_cdk::storage::stable APIs only support the 32 bit API and treats memory usage (heap+stable) as the qualifying criteria for the limit stated below. So, even if your heap is below the 4GB limit and stable save is technically possible using the stable storage API, upgrades are disallowed if total memory usage exceeds 4GB. So the canister started panicking during the pre_upgrade. Note that the panic orginates from ic_cdk source code.

This is what the error message looked like:
image

Hence, any upgrades to the canister are impossible at this stage without losing ALL state.

This is a nightmare scenario for any developer where they lose entire app state without any backup mechanisms.

And in general, this leads to upgrades on the IC being a truly scary and anxiety inducing experience. Escpecially if youā€™re building any practical use case with real users instead of a hobby project.

We REALLY need a mechanism for developers to recover from this state, no matter what the lifecycle hooks look like.

Thoughts?
@rossberg
@roman-kashitsyn

4 Likes

There is no significant difference between trap and panic. As far as I know, ic_cdk sets up a panic hook that maps panics to traps for better error messages. Generally, the system behaves identically whether you panic! (unreachable!, unimplemented!, etc.) or trap.

The canister is still usable after the pre_upgrade hook panicked, right? After the upgrade failed, the system reverted your canister to the last usable state. I donā€™t see how a try_pre_upgrade hook would solve your problem (or how itā€™s different from pre_upgrade in the first place).

In my opinion, the system canā€™t do much once we reach this state. I have some ideas on organizing the canister code to prevent this kind of disaster (which I came up with while working on ckBTC minter), but this requires preventive action.

2 Likes

It did not, right? Let me explain.

Currently my canister is on version C1. I have now upgraded to a version C2, where the pre_upgrade panics.

Now, when I push a version C3, the version C2 would run the pre_upgrade included in C2, which would panic, so the canister remains in state C2.

So, no matter how many upgrades, C3, C4, C5, I push, the canister will always revert itself back to C2 essentially never allowing me to go to a canister state where I can upgrade the canister to newer versions. Essentially, unupgradable, hence unusable.

Does the above make sense?

The try_pre_upgrade with a canister upgrade mode of only canister lifecycle changes is a naive solution that I came up with to handle this without making breaking changes to existing API. Probably has holes that I am unable to foresee.

Having said that, you are the experts and I defer this to your judgement and expertise on how to solve this problem of unrecoverable canisters that dapp developers run into on a regular basis

@roman-kashitsyn @rossberg

1 Like

Unusable in the context of the wider dapp, I guess. I.e. if you have a dapp consisting of multiple canisters and e.g. version C4 is incompatible with version C2, then C2 is unusable in that sense. But taken out of that context, the canister is still very much functional: it can handle requests from other C2 canisters or ingress messages from a v2 frontend. You and Roman are saying the same thing.

Those holes are exactly what rossberg above described. A canister Wasm with new lifecycle methods is a different canister Wasm. And there is no way of ensuring that it has the same heap layout as the earlier canister Wasm. Which is why you need pre_upgrade and post_upgrade hooks to begin with.

The suggested solution is to not rely on the heap, but instead design your own data layout and use it to lay out your data in stable memory. Then you wonā€™t need to implement pre_upgrade and post_upgrade hooks at all. All versions of your canister will understand and be able to work with your custom memory layout. If you do the heavy-lifting and ensure backwards and forwards compatibility yourself. Which is not trivial.

3 Likes

I think I understand what you mean now.

You want the system to execute the following calls during an upgrade:

C1.pre_upgrade()
C2.post_upgrade()
C2.pre_upgrade()

So the system exercises the pre_upgrade hook at least once before accepting the new version.

Note that nothing stops you from calling the pre_upgrade hook from your post_upgrade hook. You donā€™t need a new system feature for that.

This scheme will save you from trivial errors in the pre_upgrade hooks but not errors caused by increased data size. If I understand correctly, the problem in your case was the following:

C1.pre_upgrade()
C2.post_upgrade()
... C2 functions normally
... Accumulating a lot of data and leaking memory
C2.pre_upgrade() <- traps

If itā€™s the case, then calling pre_upgrade right after post_upgrade would not have resolved your issue.

4 Likes

This sounds like a mini database to me. Efficiently manage indexes in memory and store the actual data in hard drives.

On the IC, efficiently manage pointers to actual data (indexes) in heap memory and store the actual data in stable memory.

Kinda impractical to expect every dapp developer to come up with this themselves. :sweat_smile:

1 Like

I suppose the trade off here is that it could be an expensive operation.


There was a feature mentioned at some point that would allow backing up and restoring canisters.

Not only might that mitigate some of the risk around upgrades, it could potentially be used to test upgrades locally with a similar volume of data as in a production environment.

Yes, I fully agree. It might be worth doing in a ā€œdebugā€ canister build for local testing.

Yes, Iā€™d love this feature.

In my opinion, the main issue here is downloading and verifying a replicaā€™s snapshot. Currently, there is no suitable mechanism for streaming gigabytes of data from the IC.

One approach that could work is introducing the fork system call or something similar. The workflow could be:

  1. Clone a canister at version 1 with a new ID. The clone could start its life in a Stopped state.
  2. Try to upgrade the clone to version 2.
  3. Try upgrading the clone to version 2 one more time to exercise the pre_upgrade hooks.
  4. If everything goes well, upgrade the primary instance. Drop the clone if you donā€™t need it anymore.
2 Likes

Hi @saikatdas0790,

Here is what I do to make sure that if a canister is ever in that situation, I can always download the canister-data, reinstall, and upload the data.



thread_local! {
    static CANISTER_DATA: RefCell<CData> = RefCell::new(CData::new()); 
    static STATE_SNAPSHOT: RefCell<Vec<u8>> = RefCell::new(Vec::new());    
}


fn create_state_snapshot() {
    let cdata_candid_bytes: Vec<u8> = with(&CANISTER_DATA, |cdata| { encode_one(cdata).unwrap() });
    
    with_mut(&STATE_SNAPSHOT, |state_snapshot| {
        *state_snapshot = cdata_candid_bytes; 
    });
}

fn load_state_snapshot_data() {
    
    let cdata_of_the_state_snapshot: CData = with(&STATE_SNAPSHOT, |state_snapshot| {
        match decode_one::<CData>(state_snapshot) {
            Ok(cdata) => cdata,
            Err(_) => {
                trap("error decode of the state-snapshot CData");
                /*
                // or when upgrading the CData-struct, create a CData from the oldCData
                let old_cdata: OldCData = decode_one::<OldCData>(state_snapshot).unwrap();
                let cdata: CData = CData{
                    field_one: old_cdata.field_one,
                    ...
                };
                cdata
                */
            }
        }
    });

    with_mut(&CANISTER_DATA, |cdata| {
        *cdata = cdata_of_the_state_snapshot;    
    });
    
}


#[pre_upgrade]
fn pre_upgrade() {
    
    create_state_snapshot();
    
    let current_stable_size_wasm_pages: u64 = stable64_size();
    let current_stable_size_bytes: u64 = current_stable_size_wasm_pages * WASM_PAGE_SIZE_BYTES as u64;

    with(&STATE_SNAPSHOT, |state_snapshot| {
        let want_stable_memory_size_bytes: u64 = STABLE_MEMORY_HEADER_SIZE_BYTES + 8/*len of the state_snapshot*/ + state_snapshot.len() as u64; 
        if current_stable_size_bytes < want_stable_memory_size_bytes {
            stable64_grow(((want_stable_memory_size_bytes - current_stable_size_bytes) / WASM_PAGE_SIZE_BYTES as u64) + 1).unwrap();
        }
        stable64_write(STABLE_MEMORY_HEADER_SIZE_BYTES, &((state_snapshot.len() as u64).to_be_bytes()));
        stable64_write(STABLE_MEMORY_HEADER_SIZE_BYTES + 8, state_snapshot);
    });
}

#[post_upgrade]
fn post_upgrade() {
    let mut state_snapshot_len_u64_be_bytes: [u8; 8] = [0; 8];
    stable64_read(STABLE_MEMORY_HEADER_SIZE_BYTES, &mut state_snapshot_len_u64_be_bytes);
    let state_snapshot_len_u64: u64 = u64::from_be_bytes(state_snapshot_len_u64_be_bytes); 
    
    with_mut(&STATE_SNAPSHOT, |state_snapshot| {
        *state_snapshot = vec![0; state_snapshot_len_u64 as usize]; 
        stable64_read(STABLE_MEMORY_HEADER_SIZE_BYTES + 8, state_snapshot);
    });
    
    load_state_snapshot_data();
} 


#[update]
pub fn controller_create_state_snapshot() -> u64/*len of the state_snapshot*/ {
    if caller() != controller() {
        trap("Caller must be the controller for this method.");
    }
    
    create_state_snapshot();
    
    with(&STATE_SNAPSHOT, |state_snapshot| {
        state_snapshot.len() as u64
    })
}





#[query(manual_reply = true)]
pub fn controller_download_state_snapshot(chunk_i: u64) {
    if caller() != controller() {
        trap("Caller must be the controller for this method.");
    }
    let chunk_size: usize = 1 * MiB as usize;
    with(&STATE_SNAPSHOT, |state_snapshot| {
        reply::<(Option<&[u8]>,)>((state_snapshot.chunks(chunk_size).nth(chunk_i as usize),));
    });
}



#[update]
pub fn controller_clear_state_snapshot() {
    if caller() != controller() {
        trap("Caller must be the controller for this method.");
    }
    with_mut(&STATE_SNAPSHOT, |state_snapshot| {
        *state_snapshot = Vec::new();
    });    
}

#[update]
pub fn controller_append_state_snapshot(mut append_bytes: Vec<u8>) {
    if caller() != controller() {
        trap("Caller must be the controller for this method.");
    }
    with_mut(&STATE_SNAPSHOT, |state_snapshot| {
        state_snapshot.append(&mut append_bytes);
    });
}

#[update]
pub fn controller_load_state_snapshot_data() {
    if caller() != controller() {
        trap("Caller must be the controller for this method.");
    }
    
    load_state_snapshot_data();
}


Weā€™ve been using GitHub - ZhenyaUsenko/motoko-migrations: Sample project structure to implement migrations in motoko which switches the upgrades to the core instantiating functionā€¦if this fails, Iā€™d assume things would revert. It makes pre and post a bit irrelevant you use only functional, stable types.

1 Like

I donā€™t quite understand how this would work. Can you explain?

The pre_upgrade hook runs for the canister version that is being upgraded.

The post_upgrade hook runs for the canister version being installed.

How can I call the former from the latter?

If either of this is missing, the heap data is not serialized to stable memory and the app state is lost.

Please clarify if Iā€™m misunderstanding something.

Yes, this please.

This would open the door to snapshot backups. And as a developer, Iā€™m happy to pay cycle storage costs for all the snapshot backups that Iā€™m retaining

Thank you for the sample code. I can definitely see the utility of this.

But this is boilerplate that every developer will have to write and maintain because the solution is specific to how data is structured inside the specific canister.

The problem that Iā€™m trying to raise is 2 fold:

  • Canisters should not get stuck in ā€œun-upgradableā€ state.
  • Canisters need to provide a mechanism for data backup/restore that is data agnostic in case something does go wrong

I believe both of the above require changes at the system/protocol level rather than have developers write more boilerplate code that increases the surface area of things that can go wrong.

2 Likes

There could be approaches where the upgrade hooks/system functions are not stored with the canister wasm but treated as separate wasm instructions/bundle so that application logic that lives inside a canister doesnā€™t change at all. Of course, there are internal details/complications that Iā€™m not aware of, hence such a naive suggestion.

However, my point is, these are techincal hurdles at this stage. Dfinity engineers could definitely figure this out if safe backups/restores were prioritized on the roadmap