Queue + failing heartbeat + stopping canister = death spiral

bob11 · May 28, 2022, 4:58am

Here are steps to replicate:

Set up heartbeat to run a specific function (pulling from a queue) every time it runs
Make sure that the process fails
Turn on heartbeat. Heartbeat will now run forever trying to run the job in the queue, but will forever fail.
Try to stop the canister. Canister won’t stop because heartbeat has callbacks. But all requests from this point forward will fail and say “canister stopping”
At this point, you cannot call the canister at all (because canister is stopping), the canister will never stop (it can’t because heartbeat is running continuously), and heartbeat can’t stop because there is an item in the queue that keeps failing.

This is what I am affectionately calling the canister death spiral.

I know this may not be the best way to implement a queue with heartbeat, but it seems like there should be something I can do to rescue a canister in this state. Current best idea is wait for the canister to run really low on cycles (at the freezing threshold) and then regain control of the canister.

Any other thoughts/ideas?

paulyoung · May 28, 2022, 5:05am

Upgrade the canister to remove the heartbeat function?

bob11 · May 28, 2022, 5:10am

Tried it. Can’t upgrade the canister. Got this error:

error: code 5, message: "Canister 4fcza-biaaa-aaaah-abi4q-cai trapped explicitly: canister_pre_upgrade attempted with outstanding message callbacks (try stopping the canister before upgrade)"

paulyoung · May 28, 2022, 5:11am

Under “Canister Status”, the spec says this:

In all cases, calls to the management canister are processed, regardless of the state of the managed canister.

The management canister is the one that can install
code and perform upgrades.

paulyoung · May 28, 2022, 5:12am

I guess using one way functions would have prevented this. That doesn’t help now though.

paulyoung · May 28, 2022, 5:14am

Depending on the data involved, maybe you can reinstall instead of upgrade. You would lose all state though.

paulyoung · May 28, 2022, 5:16am

Not sure there’s much benefit to reinstalling over uninstalling other than I don’t remember if dfx has a command to uninstall.

bob11 · May 28, 2022, 5:17am

Yeah, we really need to preserve state, so hoping to not have to uninstall/reinstall

paulyoung · May 28, 2022, 5:20am

What’s the nature of the failure with processing the queue? Does the queue live in another canister?

I’m wondering if you can upgrade that instead to address the problem from that side.

bob11 · May 28, 2022, 5:27am

Queue lives on the same canister, so can’t upgrade on that side either

paulyoung · May 28, 2022, 5:35am

Does canister_pre_upgrade here only mean there’s a pre_upgrade function defined in your code?

If so, what happens if you remove both that and the heartbeat function and then try to upgrade using that?

paulyoung · May 28, 2022, 5:43am

If that doesn’t work, you should be able to downgrade the Motoko compiler to a version before this pull request and recompile your canister using that.

When you upgrade again you won’t get the canister_pre_upgrade error.

paulyoung · May 28, 2022, 7:49am

If downgrading is problematic you should be able to remove the relevant lines of code and build the Motoko compiler yourself.

I could do that and send you a build if you can tell me which version you’re using.

I’d also need to know some info on your system architecture. I’m on an M1 MacBook Pro so a similar device would be the most convenient. I also have a Linux VM that I can probably use if necessary.

diegop · May 28, 2022, 4:09pm

Let me ping folks internally to see if anyone has any other ideas

levi · May 28, 2022, 8:26pm

Seems to me it is better if the heartbeat stops waking up as soon as the canister is put into the stopping mode.

torates · May 29, 2022, 12:33am

Hey Paul, I’m working with Bob through this issue. The canister is controlled by my default identity on dfx.

Also, I have a M1, but mine is a MacBook Air 2022, running dfx 0.9.3. Any clue on how to re-build the motoko compiler and use it instead of my current compiler?

Your help is much appreciated.

paulyoung · May 29, 2022, 12:41am

It required some non-trivial tweaks but I did a custom build of Motoko recently using a variation of the branch linked here: nix-shell failure · Issue #3041 · dfinity/motoko · GitHub

I should easily be able to do a build of 0.9.3 that removes the trap in the pre upgrade hook based on that. You’d be trusting that I’m not doing anything malicious in a build I send you though.

If you want to try it yourself and you’re familiar with nix, using GitHub - ninegua/ic-nix: Build Internet Computer projects with Nix might be easier. Then it would be a case of following the instructions here: motoko/Building.md at master · dfinity/motoko · GitHub

levi · May 29, 2022, 1:13am

The pre-upgrade hook gets run on the module that is already running in the canister. only the post-upgrade hook is called on the new module. There is no way to upgrade the canister if the pre-upgrade hook fails. The motoko code that traps is in the pre-upgrade hook. the only way to upgrade that canister (without reinstalling or uninstalling) is to stop all the pending callbacks. If the heartbeat keeps waking up when the canister is stopping that seems like a bug to me.

paulyoung · May 29, 2022, 1:29am

That would make sense.

In that case the only way to address this might be to propose a change to the way heartbeat works and wait for that to land.

torates · May 29, 2022, 1:14pm

In any case, I think the following error message should be reformatted to avoid this issue in the future:

Error: The Replica returned an error: code 5, message: "Canister <canister_id> trapped explicitly: canister_pre_upgrade attempted with outstanding message callbacks (try stopping the canister before upgrade)"

Maybe include a clause to not try the stop the canister if the user is using heartbeat in the current state?

Topic		Replies	Views
Infinite Loop preventing backend canister from stopping need help ASAP Developers	8	758	September 29, 2022
Dfx canister --network ic stop command throwing error Developers	25	1092	April 13, 2022
What happens when a canister hits the memory heap limit? Developers BigMap	24	1949	July 5, 2022
Cannot delete canisters anymore Developers	60	3401	July 16, 2024
Canister Lifecycle Hooks Developers Discussing , community-consideration	38	3190	June 10, 2025

Queue + failing heartbeat + stopping canister = death spiral

Related topics