Tried to upgrade one of our canisters that makes guaranteed response calls and stopping it takes more time than before ~6min, before it was 10sec. I wonder if it’s related to the changes made or it’s just a subnet being overloaded right now.
I’d like to confirm that while one-way calls may not “exist”, that from motoko they seem to exist and can be used. I’d imagine under the covers the replica is reserving cycles for these as usual, so you likely have to account for them, but they won’t block you from upgrading motoko: Zero-downtime upgrades of Internet Computer canisters – Blog – Joachim Breitner's Homepage
(This is a very old article and I’m including to get confirmation that nothing has changed).
We use one-shots like this extensively in the ICRC-72 Event system to guarantee that untrusted canister’s can’t block our upgrades.
togwv-zqaaa-aaaal-qr7aa-cai
It’s stuck in ‘stopping’ phase
It is not sending calls to untrusted canisters. Stopping/deploying/starting worked for a long while without problems except today
A canister will only be unable to stop if it has outstanding calls. (Or if there was a bug in the replica implementation, but this has been working fine for years now and I don’t recall any recent related changes.) You don’t have to call an untrusted canister, it can be any canister, as long as it doesn’t return. It can even be your canister itself, stuck in a retry loop (e.g. forever retrying a CanisterNotFound): a Stopping canister can still make outgoing calls, it only refuses to accept incoming calls.
Looking at the subnet metrics (we don’t have per-canister metrics, there would be just too many of those) there appear to be at least a handful of call contexts older than 8 weeks, consistently making almost 2 calls per second between them. It’s impossible to say whether this is your canister or a different one, but whoever is making those calls in a loop will be impossible to stop (without uninstalling).
I don’t know this for sure, so please take this with a grain of salt (someone on the Execution team may be better able to confirm or deny this): while the idea behind Motoko’s one-way calls was indeed to make it possible to stop a canister with such calls still outstanding, I don’t think such behavior was ever implemented. IIUC, the precondition for stopping a canister is simply “there are no open call contexts”; and the precondition for closing a call context is “there are no open callbacks” (no checking whether the callback points to a valid continuation or to -1).
I.e. if you call into a (trusted or untrusted) canister that goes into (the equivalent of) an infinite call retry loop; or if your canister itself goes into such a loop; then your canister becomes (the wrong kind of) unstoppable.
Best-effort calls will address this issue (at least the part where you’re waiting for another canister to respond; you can still shoot yourself in the foot by going into an infinite retry loop yourself) by ensuring that any call returns something (potentially a SYS_UNKNOWN reject) within at most 5 minutes.
Edit: It is indeed the case that a canister will only stop when it has no open call contexts AND no open callbacks (the latter is implied by the former, but it’s still better to be safe than sorry, I guess).
pub fn try_stop_canister(
&mut self,
is_expired: impl Fn(&StopCanisterContext) -> bool,
) -> (bool, Vec<StopCanisterContext>) {
match self.status {
// Canister is not stopping so we can skip it.
CanisterStatus::Running { .. } | CanisterStatus::Stopped => (false, Vec::new()),
// Canister is ready to stop.
CanisterStatus::Stopping {
ref call_context_manager,
ref mut stop_contexts,
} if call_context_manager.callbacks().is_empty()
&& call_context_manager.call_contexts().is_empty() =>
And since a callback is always created, even for Motoko’s one-way calls, the only way for the canister to stop is for said callback to be closed when a response is delivered.
You mean because that prevents the canister from making any more calls? If so, that’s a neat trick.
The reason why the system won’t prevent a stopping canister from making downstream calls is that trapping in an await (or even just synchronously failing) is likely to leave the average canister in an inconsistent state (e.g. holding a lock or whatnot). But this intentional emergency “forced stop” is very much preferable to having to uninstall the canister (which would lose all the canister state).
I’d say this is rather critical to get a definitive, mutual answer on as my understanding is that there are multiple and significant canisters in the wild(some even potentially blackholed) that depend on the merge from 4 years ago being effective and doing what it says it was intended to do.
Problem is, in order for this to work as advertised, Motoko would either have to be able to make a call without creating a callback at all (which it doesn’t; it creates a callback with an invalid index, as per the description of that PR); or there would need to be logic to allow stopping a canister with all-invalid callbacks (which there isn’t).
Meaning that if you do make such a “one-way” call to a canister that does not respond (and keeps spinning indefinitely) your canister becomes un-upgradeable.
UPDATE: It was brought to my attention that a canister can be upgraded without first stopping it. I was not aware of this possibility and always assumed that when people talked about “calls to untrusted canisters blocking upgrades” they meant that a canister with a pending call cannot be stopped and a canister that is not stopped cannot be upgraded. The former is true, the latter isn’t.
So if we now look at the possibility of upgrading a canister without first stopping it, then relying on something like Motoko’s one-way calls is safer on 2 accounts:
There is zero possibility that executing the response after an upgrade will result in anything other than trapping (and is thus guaranteed to be a no-op).
Canisters making one-way calls don’t have an await, so you never end up in an inconsistent state (as you might with an await with extra logic (e.g. releasing a lock) after it; and said await traps or just never returns).
That being said, it appears that you can always upgrade a canister regardless. There’s just a risk that you might need to sort out through the aftermath (e.g. release locks and other resources held by ongoing calls).
This worked dfx canister --network ic update-settings --freezing-threshold 90000000000 --confirm-very-long-freezing-threshold canistername
I don’t think these belong to our canister. We’ve been upgrading it with stop/upgrade/start almost every day for 2-3 weeks and only yesterday it got stuck. Thanks for the quick response. We will double check all async calls, it’s quite a fringe architecture.
How do calls to the management canister fall into the new messaging model? Are the same new message types available for them as for normal inter-canister calls? Are there any differences in the guarantees they provide or in the cycles that are getting reserved for them?
Everything should work the same as with regular canisters.
The management canister has its own set of input and output queues. If a best-effort message gets stuck in there for long enough, it will expire and be dropped. If a lot of huge messages are enqueued there, some of the largest ones might get shed. Any attached cycles are lost. (The caller controls whether a call, and thus the resulting request and response, are guaranteed response or best-effort. From the POV of the management canister, it’s just an incoming request or outgoing response like any other.)