Queue + failing heartbeat + stopping canister = death spiral

nomeata · May 30, 2022, 8:13pm

One could argue that a canister in state stopping shouldn’t be running the heartbeat, just like it shouldn’t handle other “incoming” calls. And in fact that’s how it is specified in the Interface Spec (https://ic-interface-spec.netlify.app/#_call_context_creation, in “Call context creation: Heartbeat”.

So what should happens when you put the canister into stopping mode that it waits for outstanding callbacks, but eventually comes to rest.

We should get confirmation from the execution team whether the heartbeat is really not running for stopping canisters.

berestovskyy · May 30, 2022, 8:49pm

hey all,
@nomeata yes, we do not allow heartbeats to run when the canister is in stopping state (GitHub link)

@bob11 I guess it’s not the heartbeat but rather a call inside a callback, which is allowed by the spec:

Note that when processing responses, a stopping canister can make calls to other canisters and thus create new call contexts.

I also agree that falling below a freezing threshold might help, as:

a canister cannot perform calls if that would, due the cost of the call and transferred cycles, would push the balance into frozen territory; these calls fail with ic0.call_perform returning a non-zero error code.

There is an hidden option in dfx which allow to set a freezing threshold:

dfx canister update-settings --freezing-threshold <FREEZING_THRESHOLD>

Could you please try to set it arbitrary high and see if it breaks the death spiral?

torates · May 30, 2022, 9:53pm

We decided to reinstall the canister, we had a record of the transactions so we were able to recover the state somewhat. Seems I was late!

But, this is a very useful option, is there any particular reason why is it hidden? I think it should be included in the official documentation.

berestovskyy · May 31, 2022, 10:38am

Ah, super cool that you were able to recover the data, @torates

We should try to address this issue on the IC level, so other folks won’t get into the same troubles…

Looking into your source code, do you think you could have a retry logic like this:

fn heartbeat() {
    ...
    retry: loop {
        if call().await == error { continue retry }
        else { break retry }
    }
    ...
}

Maybe not explicitly, but scattered across the code base?

LightningLad91 · June 13, 2022, 1:40pm

Hello. I recently deployed a canister that has been burning through 2T cycles an hour. When I tried to upgrade the canister (adding CanisterGeek methods for monitoring) I received the same error as @bob11.

The canister provides a method to “kill” the heartbeat. It basically toggles a Boolean that is checked before the methods in the heartbeat function are called. Unfortunately, this does not resolve the issue.

@berestovskyy do you still think that raising the freezing threshold is a good path to take? Can I upgrade while the canister is frozen or would I need to lower the threshold first?

@torates would you be willing to share how you were able to recover the state of your canister?

Thanks all!

berestovskyy · June 13, 2022, 1:55pm

Raising the threshold should stop the Canister burning the Cycles. That’s a good first step IMO.

The next step would be to try to upgrade the Canister with a fix as normal before going into other hacks…

LightningLad91 · June 13, 2022, 2:24pm

I’m not sure what a fix to normal would look like in this case. I don’t really have a lot of insight into what the canister is doing, that’s why i was hoping to add some methods for monitoring activity. It sounds like raising the threshold is a good first step so I will take this back to the team so we can plan our response. Thank you

LightningLad91 · June 13, 2022, 5:15pm

Raising the freezing threshold succeeded using the command provided. Unfortunately, it seems to have set the threshold to an absurdly high number.

Error: The replica returned an HTTP Error: Http Error: status 503 Service Unavailable, content type “application/cbor”, content: Canister t2mog-myaaa-aaaal-aas7q-cai is out of cycles: requested 1410000 cycles but the available balance is 29059112097781 cycles and the freezing threshold 10019854835671809501 cycles

This is the command I used to raise the threshold:

dfx canister --network=ic update-settings --all --freezing-threshold 27563000000000

I don’t understand why the threshold was set so high. I’ve submitted a support ticket to Dfinity. Hopefully we can find a path forward.

berestovskyy · June 13, 2022, 6:06pm

Oh, sorry @LightningLad91

It’s counter-intuitive, but the freezing threshold is in seconds, not in cycles: The Internet Computer Interface Specification | Internet Computer Home

Now the limit can’t be lowered the Canister is controlled by a user, and IC tries to charge the Canister for an ingress message.

We’re preparing a patch to unblock you…

LightningLad91 · June 13, 2022, 6:07pm

Thank you @berestovskyy ! Both Dfinity and Toniq have been very helpful in this matter and I appreciate that.

LightningLad91 · June 14, 2022, 11:00pm

Our canister is up and running! Thank you again to @berestovskyy @bob11 and everyone at Dfinity who helped push this fix so quickly.

icme · September 14, 2022, 5:34am

According to the freezing_threshold in this link

“A canister is considered frozen whenever the IC estimates that the canister would be depleted of cycles before freezing_threshold seconds pass, given the canister’s current size and the IC’s current cost for storage.”

Where and how is this estimate is calculated?

A few follow-up questions regarding behavior after hitting the freezing threshold:

Does a canister continue to burn through cycles at roughly the storage cost rate?
Can the freezing threshold be lowered via the management canister in order to “un-freeze” the canister
Can all remaining cycles be harvested from the canister in order to “zombify it”?

Severin · September 14, 2022, 6:29am

I’m no expert in this topic, but my understanding is as follows:

Yes, freezing means that your canister will be unresponsive for a while so you can notice it and top it up with cycles before it runs out of cycles entirely and the data is deleted.

Almost all cycles. dfx canister delete requires an unfrozen canister to withdraw cycles, but if you do the management call uninstall_code you should be able to unfreeze basically any canister. EDIT: I probably misunderstood this question, please disregard.

dsarlis · September 14, 2022, 12:53pm

Allow me to add some more colour to Severin’s response:

Can the freezing threshold be lowered via the management canister in order to “un-freeze” the canister

Yes, it’s possible.

Can all remaining cycles be harvested from the canister in order to “zombify it”?

Things are a bit more tricky if you really want to salvage cycles from a canister that’s frozen. If you use uninstall_code as Severin said, technically your canister will be unfrozen but that’s because your canister’s data will be deleted so there’s nothing that you need to keep paying for. The wasm module will also be gone, which means you won’t have a working canister to transfer cycles out.

I think you actually need to unfreeze it first and then be able to transfer the cycles out somehow (e.g. if the canister exposes some method to transfer them to someone else). You’ll likely also need to tune down your freezing threshold so you can send out as many of your cycles as possible (you cannot dip into your freezing threshold because we need it to ensure your canister can pay out storage but if your drop it to something that would be enough to pay for a couple hours and then transfer out your loss would be negligible).

Finally, canister deletion wouldn’t work since remaining cycles are discarded in that case as per the interface spec.

Severin · September 14, 2022, 4:20pm

Thank you for correcting, @dsarlis, I misunderstood the question. I was answering for “can the cycles be harvested from a zombified canister (I read this as ‘a frozen canister’) ?”

Topic		Replies	Views
Infinite Loop preventing backend canister from stopping need help ASAP Developers	8	741	September 29, 2022
Dfx canister --network ic stop command throwing error Developers	25	1074	April 13, 2022
Unable to stop canister Rust	4	39	February 27, 2025
How to upgrade a canister with hanging requests Motoko	5	319	June 20, 2023
What are the possible ways a canister may stop working? Developers Discussing	4	519	July 8, 2022

Queue + failing heartbeat + stopping canister = death spiral

Related topics