Queue + failing heartbeat + stopping canister = death spiral

One could argue that a canister in state stopping shouldn’t be running the heartbeat, just like it shouldn’t handle other “incoming” calls. And in fact that’s how it is specified in the Interface Spec (https://ic-interface-spec.netlify.app/#_call_context_creation, in “Call context creation: Heartbeat”.

So what should happens when you put the canister into stopping mode that it waits for outstanding callbacks, but eventually comes to rest.

We should get confirmation from the execution team whether the heartbeat is really not running for stopping canisters.

2 Likes

hey all,
@nomeata yes, we do not allow heartbeats to run when the canister is in stopping state (GitHub link)

@bob11 I guess it’s not the heartbeat but rather a call inside a callback, which is allowed by the spec:

Note that when processing responses, a stopping canister can make calls to other canisters and thus create new call contexts.

I also agree that falling below a freezing threshold might help, as:

a canister cannot perform calls if that would, due the cost of the call and transferred cycles, would push the balance into frozen territory; these calls fail with ic0.call_perform returning a non-zero error code.

There is an hidden option in dfx which allow to set a freezing threshold:

dfx canister update-settings --freezing-threshold <FREEZING_THRESHOLD>

Could you please try to set it arbitrary high and see if it breaks the death spiral?

2 Likes

We decided to reinstall the canister, we had a record of the transactions so we were able to recover the state somewhat. Seems I was late!

But, this is a very useful option, is there any particular reason why is it hidden? I think it should be included in the official documentation.

1 Like

Ah, super cool that you were able to recover the data, @torates

We should try to address this issue on the IC level, so other folks won’t get into the same troubles…

Looking into your source code, do you think you could have a retry logic like this:

fn heartbeat() {
    ...
    retry: loop {
        if call().await == error { continue retry }
        else { break retry }
    }
    ...
}

Maybe not explicitly, but scattered across the code base?

1 Like

Hello. I recently deployed a canister that has been burning through 2T cycles an hour. When I tried to upgrade the canister (adding CanisterGeek methods for monitoring) I received the same error as @bob11.

The canister provides a method to “kill” the heartbeat. It basically toggles a Boolean that is checked before the methods in the heartbeat function are called. Unfortunately, this does not resolve the issue.

@berestovskyy do you still think that raising the freezing threshold is a good path to take? Can I upgrade while the canister is frozen or would I need to lower the threshold first?

@torates would you be willing to share how you were able to recover the state of your canister?

Thanks all!

Raising the threshold should stop the Canister burning the Cycles. That’s a good first step IMO.

The next step would be to try to upgrade the Canister with a fix as normal before going into other hacks…

1 Like

I’m not sure what a fix to normal would look like in this case. I don’t really have a lot of insight into what the canister is doing, that’s why i was hoping to add some methods for monitoring activity. It sounds like raising the threshold is a good first step so I will take this back to the team so we can plan our response. Thank you

Raising the freezing threshold succeeded using the command provided. Unfortunately, it seems to have set the threshold to an absurdly high number.

Error: The replica returned an HTTP Error: Http Error: status 503 Service Unavailable, content type “application/cbor”, content: Canister t2mog-myaaa-aaaal-aas7q-cai is out of cycles: requested 1410000 cycles but the available balance is 29059112097781 cycles and the freezing threshold 10019854835671809501 cycles

This is the command I used to raise the threshold:

dfx canister --network=ic update-settings --all --freezing-threshold 27563000000000

I don’t understand why the threshold was set so high. I’ve submitted a support ticket to Dfinity. Hopefully we can find a path forward.

Oh, sorry @LightningLad91

It’s counter-intuitive, but the freezing threshold is in seconds, not in cycles: The Internet Computer Interface Specification | Internet Computer Home

Now the limit can’t be lowered the Canister is controlled by a user, and IC tries to charge the Canister for an ingress message.

We’re preparing a patch to unblock you…

1 Like

Thank you @berestovskyy ! Both Dfinity and Toniq have been very helpful in this matter and I appreciate that.

Our canister is up and running! Thank you again to @berestovskyy @bob11 and everyone at Dfinity who helped push this fix so quickly.

4 Likes

According to the freezing_threshold in this link :point_up:

“A canister is considered frozen whenever the IC estimates that the canister would be depleted of cycles before freezing_threshold seconds pass, given the canister’s current size and the IC’s current cost for storage.”

Where and how is this estimate is calculated?

A few follow-up questions regarding behavior after hitting the freezing threshold:

  • Does a canister continue to burn through cycles at roughly the storage cost rate?
  • Can the freezing threshold be lowered via the management canister in order to “un-freeze” the canister
  • Can all remaining cycles be harvested from the canister in order to “zombify it”?

I’m no expert in this topic, but my understanding is as follows:

Yes, freezing means that your canister will be unresponsive for a while so you can notice it and top it up with cycles before it runs out of cycles entirely and the data is deleted.

Almost all cycles. dfx canister delete requires an unfrozen canister to withdraw cycles, but if you do the management call uninstall_code you should be able to unfreeze basically any canister. EDIT: I probably misunderstood this question, please disregard.

3 Likes

Allow me to add some more colour to Severin’s response:

Can the freezing threshold be lowered via the management canister in order to “un-freeze” the canister

Yes, it’s possible.

Can all remaining cycles be harvested from the canister in order to “zombify it”?

Things are a bit more tricky if you really want to salvage cycles from a canister that’s frozen. If you use uninstall_code as Severin said, technically your canister will be unfrozen but that’s because your canister’s data will be deleted so there’s nothing that you need to keep paying for. The wasm module will also be gone, which means you won’t have a working canister to transfer cycles out.

I think you actually need to unfreeze it first and then be able to transfer the cycles out somehow (e.g. if the canister exposes some method to transfer them to someone else). You’ll likely also need to tune down your freezing threshold so you can send out as many of your cycles as possible (you cannot dip into your freezing threshold because we need it to ensure your canister can pay out storage but if your drop it to something that would be enough to pay for a couple hours and then transfer out your loss would be negligible).

Finally, canister deletion wouldn’t work since remaining cycles are discarded in that case as per the interface spec.

4 Likes

Thank you for correcting, @dsarlis, I misunderstood the question. I was answering for “can the cycles be harvested from a zombified canister (I read this as ‘a frozen canister’) ?”