Trying to get a canister update itself (succeeds with an error)

We’ve been working on a canister that needs to upgrade itself.

So far we’ve gotten to the point where the canister can sucessfully upgrade but the call throws an error (even if the update succeeds).

Link to minimal example:
icp-self-upgrade-demo

To reproduce:

  1. Deploy the canister with:
    dfx deploy

  2. Add the canister itself as a controller:
    dfx canister call aaaaa-aa update_settings '(record { canister_id = principal "'$(dfx canister id child)'"; settings = record { controllers = opt vec { principal "'$(dfx canister id child)'"; principal "'$(dfx identity get-principal)'"; }; }; })'

  3. Call the upgrade function
    dfx canister call child upgrade_canister

Although the canister upgrades itself succesfully it throws an error:

Failed update call.
  The Replica returned an error: code 5, message: "Canister rrkah-fqaaa-aaaaa-aaaaq-cai violated contract: table not found

call_on_cleanup also failed:

Canister rrkah-fqaaa-aaaaa-aaaaq-cai violated contract: table not found"

If anyone tried this before or have more knowledge on the inner workings of IC could shed some light?

2 Likes

This makes sense. You are making an async call to the management canister to upgrade the canister that is running. When the upgrade succeeds, the management canister sends a response to a canister that is no longer running the same code, so it cannot process the response as the call context is no longer there. It can’t pick up the same execution thread that the await happened in. So the work succeeded, but the reply can’t be processed.

Currently, the best solution is a very simple proxy canister. See NNS or SNS root implementation to see how the Governance canisters are upgraded.

3 Likes

Hi @LiveDuo, currently callbacks are registered using 1 function pointer and 1 env pointer per callback. The rust-cdk abstracts over that by pinning the Future/Waker and then registering the callback with the function-pointer as the global-cdk-callback-function, and with the env pointer as the pointer to the Future/Waker pin. When the callback comes in, the global-cdk-callback-function uses the env pointer to find the Future/Waker pin of the original Future and wakes up the future, thus continuing from the await point.
In this case, the global-cdk-callback-function in the new module after the upgrade has a different function pointer value than the pointer that is registered for the callback and the new module does not contain the Future/Waker Pin of the original Future because it was not carried over through the upgrade, so the function pointer and env pointer are invalid.

One coming feature on the roadmap is the name-callbacks feature, which lets canisters register callbacks using function names (instead of pointers), so that callbacks can come in across upgrades as long as the new module contains a callback function with the same name as the registered callback. The name-callbacks feature lets canisters upgrade themselves in a safe way and without stopping.

4 Likes

Thanks @msumme for going deeper into the issue.

Something we noticed is that the error happens even if the canister has the same WASM module but to your comment it doesn’t matter for now since been unable to handle the error could be a reliability problem.

We will have a look at the NNS implementation and hopefully get some ideas. For now we are exploring other options out as our canister have EOA controllers and will keep an eye on name-callbacks @levi mentioned above.

We are assessing our options atm to not hold back for long. Are there any updates on the name-callbacks proposal since December?

For now we are exploring other options out as our canister have EOA controllers and will keep an eye on name-callbacks @levi mentioned above.

There is an option you can use if you don’t care about the result of response you will get back (unlikely I suppose as you seem to care to know whether the upgrade has succeeded or not). It’s the “one-way” calls that you get if you use CDK’s notify method.

Building on what the others have explained in the thread, notify sets a function pointer which is invalid by construction (technically a value of -1), which means that when the callback is received, the system will just not execute any Wasm because of the invalid pointer. The response is still delivered and the system does its own bookkeeping (so no worries that you’re left with dangling callbacks), only your canister doesn’t get the opportunity to execute any code in the callback.

I should still warn that this workaround is only working reliably for canisters that only use such notifications and no other messages are sent to other canisters – if you do, then you need to stop your canister first but of course that doesn’t help you when you’re trying to do a self-upgrade. Named callbacks aim to tackle this problem and provide a safe upgrade path without even having to stop your canister.

We are assessing our options atm to not hold back for long. Are there any updates on the name-callbacks proposal since December?

The team is currently wrapping up some other work and hopefully we can get back to named callbacks as we’ll have more capacity. I believe that we can have a first version of it in the next couple of months.

1 Like

@dsarlis So far we used the call method inside a spawn, are there any advantages using the notify method?

Turns out we might achieve a similar result with the post_upgrade method. Whatever code would run after the install_code method may run in post_upgrade. It’s not ideal, as the migration logic is couple with the upgrade logic and there seem to be more ways to break the canister, nevertheless it’s a way forward.

In the meantime we’ll keep an eye on the protocol level roadmap for when name-callbacks are shipped.

1 Like

@dsarlis So far we used the call method inside a spawn, are there any advantages using the notify method?

I don’t think using this approach actually solves the problem described in this thread. Even if you use spawn, you would still have some callback that comes back eventually and it might have the same problems if it’s executed and it points to a wrong function index in the function table. The notify method approach goes around this potential issue by using the “always invalid” index of -1.

Whatever code would run after the install_code method may run in post_upgrade

There’s a catch here if you’re not aware: you can’t run asynchronous code in post_upgrade (similar to init). So, in practice, if you need to make calls to other canisters, that is not possible in post_upgrade.

1 Like