Queue + failing heartbeat + stopping canister = death spiral

Here are steps to replicate:

  1. Set up heartbeat to run a specific function (pulling from a queue) every time it runs
  2. Make sure that the process fails
  3. Turn on heartbeat. Heartbeat will now run forever trying to run the job in the queue, but will forever fail.
  4. Try to stop the canister. Canister won’t stop because heartbeat has callbacks. But all requests from this point forward will fail and say “canister stopping”
  5. At this point, you cannot call the canister at all (because canister is stopping), the canister will never stop (it can’t because heartbeat is running continuously), and heartbeat can’t stop because there is an item in the queue that keeps failing.

This is what I am affectionately calling the canister death spiral.

I know this may not be the best way to implement a queue with heartbeat, but it seems like there should be something I can do to rescue a canister in this state. Current best idea is wait for the canister to run really low on cycles (at the freezing threshold) and then regain control of the canister.

Any other thoughts/ideas?

2 Likes

Upgrade the canister to remove the heartbeat function?

Tried it. Can’t upgrade the canister. Got this error:

error: code 5, message: "Canister 4fcza-biaaa-aaaah-abi4q-cai trapped explicitly: canister_pre_upgrade attempted with outstanding message callbacks (try stopping the canister before upgrade)"

Under “Canister Status”, the spec says this:

In all cases, calls to the management canister are processed, regardless of the state of the managed canister.

The management canister is the one that can install
code and perform upgrades.

1 Like

I guess using one way functions would have prevented this. That doesn’t help now though.

1 Like

Depending on the data involved, maybe you can reinstall instead of upgrade. You would lose all state though.

Not sure there’s much benefit to reinstalling over uninstalling other than I don’t remember if dfx has a command to uninstall.

Yeah, we really need to preserve state, so hoping to not have to uninstall/reinstall

What’s the nature of the failure with processing the queue? Does the queue live in another canister?

I’m wondering if you can upgrade that instead to address the problem from that side.

Queue lives on the same canister, so can’t upgrade on that side either

Does canister_pre_upgrade here only mean there’s a pre_upgrade function defined in your code?

If so, what happens if you remove both that and the heartbeat function and then try to upgrade using that?

1 Like

If that doesn’t work, you should be able to downgrade the Motoko compiler to a version before this pull request and recompile your canister using that.

When you upgrade again you won’t get the canister_pre_upgrade error.

3 Likes

If downgrading is problematic you should be able to remove the relevant lines of code and build the Motoko compiler yourself.

I could do that and send you a build if you can tell me which version you’re using.

I’d also need to know some info on your system architecture. I’m on an M1 MacBook Pro so a similar device would be the most convenient. I also have a Linux VM that I can probably use if necessary.

Let me ping folks internally to see if anyone has any other ideas

Seems to me it is better if the heartbeat stops waking up as soon as the canister is put into the stopping mode.

1 Like

Hey Paul, I’m working with Bob through this issue. The canister is controlled by my default identity on dfx.

Also, I have a M1, but mine is a MacBook Air 2022, running dfx 0.9.3. Any clue on how to re-build the motoko compiler and use it instead of my current compiler?

Your help is much appreciated.

It required some non-trivial tweaks but I did a custom build of Motoko recently using a variation of the branch linked here: nix-shell failure · Issue #3041 · dfinity/motoko · GitHub

I should easily be able to do a build of 0.9.3 that removes the trap in the pre upgrade hook based on that. You’d be trusting that I’m not doing anything malicious in a build I send you though.

If you want to try it yourself and you’re familiar with nix, using GitHub - ninegua/ic-nix: Build Internet Computer projects with Nix might be easier. Then it would be a case of following the instructions here: motoko/Building.md at master · dfinity/motoko · GitHub

The pre-upgrade hook gets run on the module that is already running in the canister. only the post-upgrade hook is called on the new module. There is no way to upgrade the canister if the pre-upgrade hook fails. The motoko code that traps is in the pre-upgrade hook. the only way to upgrade that canister (without reinstalling or uninstalling) is to stop all the pending callbacks. If the heartbeat keeps waking up when the canister is stopping that seems like a bug to me.

3 Likes

That would make sense.

In that case the only way to address this might be to propose a change to the way heartbeat works and wait for that to land.

1 Like

In any case, I think the following error message should be reformatted to avoid this issue in the future:

Error: The Replica returned an error: code 5, message: "Canister <canister_id> trapped explicitly: canister_pre_upgrade attempted with outstanding message callbacks (try stopping the canister before upgrade)"

Maybe include a clause to not try the stop the canister if the user is using heartbeat in the current state?