Tried to upgrade one of our canisters that makes guaranteed response calls and stopping it takes more time than before ~6min, before it was 10sec. I wonder if it’s related to the changes made or it’s just a subnet being overloaded right now.
I’d like to confirm that while one-way calls may not “exist”, that from motoko they seem to exist and can be used. I’d imagine under the covers the replica is reserving cycles for these as usual, so you likely have to account for them, but they won’t block you from upgrading motoko: Zero-downtime upgrades of Internet Computer canisters – Blog – Joachim Breitner's Homepage
(This is a very old article and I’m including to get confirmation that nothing has changed).
We use one-shots like this extensively in the ICRC-72 Event system to guarantee that untrusted canister’s can’t block our upgrades.
togwv-zqaaa-aaaal-qr7aa-cai
It’s stuck in ‘stopping’ phase
It is not sending calls to untrusted canisters. Stopping/deploying/starting worked for a long while without problems except today
A canister will only be unable to stop if it has outstanding calls. (Or if there was a bug in the replica implementation, but this has been working fine for years now and I don’t recall any recent related changes.) You don’t have to call an untrusted canister, it can be any canister, as long as it doesn’t return. It can even be your canister itself, stuck in a retry loop (e.g. forever retrying a CanisterNotFound): a Stopping canister can still make outgoing calls, it only refuses to accept incoming calls.
Looking at the subnet metrics (we don’t have per-canister metrics, there would be just too many of those) there appear to be at least a handful of call contexts older than 8 weeks, consistently making almost 2 calls per second between them. It’s impossible to say whether this is your canister or a different one, but whoever is making those calls in a loop will be impossible to stop (without uninstalling).
I don’t know this for sure, so please take this with a grain of salt (someone on the Execution team may be better able to confirm or deny this): while the idea behind Motoko’s one-way calls was indeed to make it possible to stop a canister with such calls still outstanding, I don’t think such behavior was ever implemented. IIUC, the precondition for stopping a canister is simply “there are no open call contexts”; and the precondition for closing a call context is “there are no open callbacks” (no checking whether the callback points to a valid continuation or to -1).
I.e. if you call into a (trusted or untrusted) canister that goes into (the equivalent of) an infinite call retry loop; or if your canister itself goes into such a loop; then your canister becomes (the wrong kind of) unstoppable.
Best-effort calls will address this issue (at least the part where you’re waiting for another canister to respond; you can still shoot yourself in the foot by going into an infinite retry loop yourself) by ensuring that any call returns something (potentially a SYS_UNKNOWN reject) within at most 5 minutes.
Edit: It is indeed the case that a canister will only stop when it has no open call contexts AND no open callbacks (the latter is implied by the former, but it’s still better to be safe than sorry, I guess).
pub fn try_stop_canister(
&mut self,
is_expired: impl Fn(&StopCanisterContext) -> bool,
) -> (bool, Vec<StopCanisterContext>) {
match self.status {
// Canister is not stopping so we can skip it.
CanisterStatus::Running { .. } | CanisterStatus::Stopped => (false, Vec::new()),
// Canister is ready to stop.
CanisterStatus::Stopping {
ref call_context_manager,
ref mut stop_contexts,
} if call_context_manager.callbacks().is_empty()
&& call_context_manager.call_contexts().is_empty() =>
And since a callback is always created, even for Motoko’s one-way calls, the only way for the canister to stop is for said callback to be closed when a response is delivered.
You mean because that prevents the canister from making any more calls? If so, that’s a neat trick.
The reason why the system won’t prevent a stopping canister from making downstream calls is that trapping in an await (or even just synchronously failing) is likely to leave the average canister in an inconsistent state (e.g. holding a lock or whatnot). But this intentional emergency “forced stop” is very much preferable to having to uninstall the canister (which would lose all the canister state).
I’d say this is rather critical to get a definitive, mutual answer on as my understanding is that there are multiple and significant canisters in the wild(some even potentially blackholed) that depend on the merge from 4 years ago being effective and doing what it says it was intended to do.
Problem is, in order for this to work as advertised, Motoko would either have to be able to make a call without creating a callback at all (which it doesn’t; it creates a callback with an invalid index, as per the description of that PR); or there would need to be logic to allow stopping a canister with all-invalid callbacks (which there isn’t).
Meaning that if you do make such a “one-way” call to a canister that does not respond (and keeps spinning indefinitely) your canister becomes un-upgradeable.
UPDATE: It was brought to my attention that a canister can be upgraded without first stopping it. I was not aware of this possibility and always assumed that when people talked about “calls to untrusted canisters blocking upgrades” they meant that a canister with a pending call cannot be stopped and a canister that is not stopped cannot be upgraded. The former is true, the latter isn’t.
So if we now look at the possibility of upgrading a canister without first stopping it, then relying on something like Motoko’s one-way calls is safer on 2 accounts:
There is zero possibility that executing the response after an upgrade will result in anything other than trapping (and is thus guaranteed to be a no-op).
Canisters making one-way calls don’t have an await, so you never end up in an inconsistent state (as you might with an await with extra logic (e.g. releasing a lock) after it; and said await traps or just never returns).
That being said, it appears that you can always upgrade a canister regardless. There’s just a risk that you might need to sort out through the aftermath (e.g. release locks and other resources held by ongoing calls).
This worked dfx canister --network ic update-settings --freezing-threshold 90000000000 --confirm-very-long-freezing-threshold canistername
I don’t think these belong to our canister. We’ve been upgrading it with stop/upgrade/start almost every day for 2-3 weeks and only yesterday it got stuck. Thanks for the quick response. We will double check all async calls, it’s quite a fringe architecture.
How do calls to the management canister fall into the new messaging model? Are the same new message types available for them as for normal inter-canister calls? Are there any differences in the guarantees they provide or in the cycles that are getting reserved for them?
Everything should work the same as with regular canisters.
The management canister has its own set of input and output queues. If a best-effort message gets stuck in there for long enough, it will expire and be dropped. If a lot of huge messages are enqueued there, some of the largest ones might get shed. Any attached cycles are lost. (The caller controls whether a call, and thus the resulting request and response, are guaranteed response or best-effort. From the POV of the management canister, it’s just an incoming request or outgoing response like any other.)
Announcing the General Availability of Best-Effort Calls
Best-effort calls (now known as “bounded-wait calls”, as opposed to the already existing “unbounded-wait calls”) are now generally available for use across the IC. The incremental rollout first saw OpenChat switching most of their traffic to bounded-wait calls; then various other dapps across application subnets; finally, system subnets had support for bounded-wait calls enabled a couple of weeks ago.
This has been a tremendous cross-team effort, with implementation driven by the Message Routing, SDK, Motoko and Execution teams, based on constant discussions with and inputs from a wide range of DFINITY and external dapp developers (Cross Chain, Financial Integration, NNS, OpenChat and many more).
How to Use Unbounded Wait Calls
First, if you think you might benefit from a concise refresher on the IC’s messaging model, you can refer to the top post of this thread. For more in-depth treatment, we strongly recommend consulting the reference documentation:
There are two goals / functionality gaps that we are already working on addressing.
The first is reliable cycle transfers over bounded-wait calls. In the absence of delivery guarantees, cycles attached to bounded-wait calls can be lost if the message carrying them is dropped (because it timed out or because of high system load). While this is an acceptable risk for cycle amounts on the order of a message execution fee (e.g. when paying for the cost of a downstream call), it is problematic for larger cycle amounts (like loading a canister with cycles). So we need a reliable way to transfer cycles over bounded-wait calls, with any not delivered / not accepted cycles refunded to the caller.
Second is compatibility with / support for legacy, non-idempotent APIs. Safe and correct use of bounded-wait calls (or ingress messages) requires idempotent APIs: if the caller does not learn the outcome of a call, they must be able to issue the same call again; or have a side-channel they can use to learn the outcome (like, e.g. querying a ledger). For cases where canisters cannot (e.g. blackholed) or just will not provide an idempotent API, we must provide a way for callers primarily using bounded-wait calls to reliably interact with such legacy APIs.
We will follow up here as soon as we have something concrete.
Will boundary nodes be sending bounded-wait calls at some point? Either by directing(via the system call API) or by choice to save cycles/processing? I’d imagine any idempotent strategy should assume so but I was curious if there were plans for that.
Has there been any formal discussion around a ubiquitous idempotent strategy as a best practice going forward. I hadn’t really though about it, but it seems that most existing APIs may get a bit dicey as you mention in your post. Even things like ICRC-1 transactions are going to be a pain to retry if you have to query the blocks and scan for your transaction(maybe by memo?) before retrying.
Bounded-wait vs. unbounded-wait only applies to canister-to-canister calls. Boundary nodes deal with ingress messages, which are different in significant ways.
E.g. I suppose you could say they look a lot like unbounded calls, in that you just issue the call and then have to keep polling for arbitrarily long until it completes, in order to learn the outcome. OTOH, and more importantly, they share similar limitations to bounded-wait calls, in that there is no hard guarantee you will directly learn the outcome, whether because your network connection drops for 5+ minutes; the response gets dropped (and replaced with a plain Done) because of excessive load; clock skew; stalled subnets; etc.
If your question is “will ingress messages also get explicit timeouts”, then the answer is you can simply stop polling whenever you want to. I’m not particularly familiar with the various agent libraries and whether they already support this or not, but there isn’t much more that the protocol can do about timing out ingress messages, beyond maybe not bothering to keep around a message status after a while, as an optimization. As with bounded-wait calls, the call context can continue running indefinitely regardless.
There has and @oggy has even put together a client library for upgrade-safe retry strategies to invoke both idempotent and non-idempotent APIs with bounded-wait calls; safe canister upgrades using bounded-wait calls; and even chaos monkey testing your canister’s bounded-wait calls. It’s in the process of being officially open sourced, there will be an update on this thread when it’s available.
Finally, there’s definitely more work to be done in this area, particularly on the server side, providing guidance and building blocks for implementing idempotent APIs. Particularly since, as per the above, idempotent APIs are also crucially important for safely and correctly interacting with the IC via ingress messages.
Here’s the link if anyone’s interested in the Rust libraries that @free mentioned:
Note that this hasn’t been reviewed particularly thoroughly so use at your own risk at the moment. I hope to get a few more eyes on this in the next few days.