Scalable Messaging Model

Tried to upgrade one of our canisters that makes guaranteed response calls and stopping it takes more time than before ~6min, before it was 10sec. I wonder if it’s related to the changes made or it’s just a subnet being overloaded right now.

EDIT: actually not stopping for an hour now

1 Like

I’d like to confirm that while one-way calls may not “exist”, that from motoko they seem to exist and can be used. I’d imagine under the covers the replica is reserving cycles for these as usual, so you likely have to account for them, but they won’t block you from upgrading motoko: Zero-downtime upgrades of Internet Computer canisters – Blog – Joachim Breitner's Homepage

(This is a very old article and I’m including to get confirmation that nothing has changed).

We use one-shots like this extensively in the ICRC-72 Event system to guarantee that untrusted canister’s can’t block our upgrades.

What’s the canister status?

togwv-zqaaa-aaaal-qr7aa-cai
It’s stuck in ‘stopping’ phase
It is not sending calls to untrusted canisters. Stopping/deploying/starting worked for a long while without problems except today

Hmm, the togwv-zqaaa-aaaal-qr7aa-cai canister is on the e66qm subnet, which is functioning normally: https://dashboard.internetcomputer.org/subnet/e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe

A canister will only be unable to stop if it has outstanding calls. (Or if there was a bug in the replica implementation, but this has been working fine for years now and I don’t recall any recent related changes.) You don’t have to call an untrusted canister, it can be any canister, as long as it doesn’t return. It can even be your canister itself, stuck in a retry loop (e.g. forever retrying a CanisterNotFound): a Stopping canister can still make outgoing calls, it only refuses to accept incoming calls.

Looking at the subnet metrics (we don’t have per-canister metrics, there would be just too many of those) there appear to be at least a handful of call contexts older than 8 weeks, consistently making almost 2 calls per second between them. It’s impossible to say whether this is your canister or a different one, but whoever is making those calls in a loop will be impossible to stop (without uninstalling).

I don’t know this for sure, so please take this with a grain of salt (someone on the Execution team may be better able to confirm or deny this): while the idea behind Motoko’s one-way calls was indeed to make it possible to stop a canister with such calls still outstanding, I don’t think such behavior was ever implemented. IIUC, the precondition for stopping a canister is simply “there are no open call contexts”; and the precondition for closing a call context is “there are no open callbacks” (no checking whether the callback points to a valid continuation or to -1).

I.e. if you call into a (trusted or untrusted) canister that goes into (the equivalent of) an infinite call retry loop; or if your canister itself goes into such a loop; then your canister becomes (the wrong kind of) unstoppable.

Best-effort calls will address this issue (at least the part where you’re waiting for another canister to respond; you can still shoot yourself in the foot by going into an infinite retry loop yourself) by ensuring that any call returns something (potentially a SYS_UNKNOWN reject) within at most 5 minutes.

Edit: It is indeed the case that a canister will only stop when it has no open call contexts AND no open callbacks (the latter is implied by the former, but it’s still better to be safe than sorry, I guess).

    pub fn try_stop_canister(
        &mut self,
        is_expired: impl Fn(&StopCanisterContext) -> bool,
    ) -> (bool, Vec<StopCanisterContext>) {
        match self.status {
            // Canister is not stopping so we can skip it.
            CanisterStatus::Running { .. } | CanisterStatus::Stopped => (false, Vec::new()),

            // Canister is ready to stop.
            CanisterStatus::Stopping {
                ref call_context_manager,
                ref mut stop_contexts,
            } if call_context_manager.callbacks().is_empty()
                && call_context_manager.call_contexts().is_empty() =>

And since a callback is always created, even for Motoko’s one-way calls, the only way for the canister to stop is for said callback to be closed when a response is delivered.

1 Like

The usual trick is to set the freezing threshold really high. Then uninstalling isn’t required.

2 Likes

You mean because that prevents the canister from making any more calls? If so, that’s a neat trick.

The reason why the system won’t prevent a stopping canister from making downstream calls is that trapping in an await (or even just synchronously failing) is likely to leave the average canister in an inconsistent state (e.g. holding a lock or whatnot). But this intentional emergency “forced stop” is very much preferable to having to uninstall the canister (which would lose all the canister state).

There was at least an effort to implement this on the motoko side(Use an invalid table index as the callback for one-way functions by nomeata · Pull Request #2942 · dfinity/motoko · GitHub). Can @ggreif or @claudio confirm what motoko is doing in this case today? Perhaps this was refactored recently on the replica side to subvert what previous behavior may have been?

I’d say this is rather critical to get a definitive, mutual answer on as my understanding is that there are multiple and significant canisters in the wild(some even potentially blackholed) that depend on the merge from 4 years ago being effective and doing what it says it was intended to do.

1 Like

I’m pretty sure it is implemented In Motoko.

Problem is, in order for this to work as advertised, Motoko would either have to be able to make a call without creating a callback at all (which it doesn’t; it creates a callback with an invalid index, as per the description of that PR); or there would need to be logic to allow stopping a canister with all-invalid callbacks (which there isn’t).

Meaning that if you do make such a “one-way” call to a canister that does not respond (and keeps spinning indefinitely) your canister becomes un-upgradeable.

UPDATE: It was brought to my attention that a canister can be upgraded without first stopping it. I was not aware of this possibility and always assumed that when people talked about “calls to untrusted canisters blocking upgrades” they meant that a canister with a pending call cannot be stopped and a canister that is not stopped cannot be upgraded. The former is true, the latter isn’t.

So if we now look at the possibility of upgrading a canister without first stopping it, then relying on something like Motoko’s one-way calls is safer on 2 accounts:

  • There is zero possibility that executing the response after an upgrade will result in anything other than trapping (and is thus guaranteed to be a no-op).
  • Canisters making one-way calls don’t have an await, so you never end up in an inconsistent state (as you might with an await with extra logic (e.g. releasing a lock) after it; and said await traps or just never returns).

That being said, it appears that you can always upgrade a canister regardless. There’s just a risk that you might need to sort out through the aftermath (e.g. release locks and other resources held by ongoing calls).

This worked
dfx canister --network ic update-settings --freezing-threshold 90000000000 --confirm-very-long-freezing-threshold canistername

I don’t think these belong to our canister. We’ve been upgrading it with stop/upgrade/start almost every day for 2-3 weeks and only yesterday it got stuck. Thanks for the quick response. We will double check all async calls, it’s quite a fringe architecture.

I can confirm that Motoko still uses the technique of implementing one-ways by passing an invalid table index for the call backs.

If that no longer works on the platform then that would be good to know, but I assume it still does.

1 Like

This was released in Motoko 0.14.2 and is part of dfx 0.25.1-beta1.

Cheers,

— Gabor

3 Likes

How do calls to the management canister fall into the new messaging model? Are the same new message types available for them as for normal inter-canister calls? Are there any differences in the guarantees they provide or in the cycles that are getting reserved for them?

Everything should work the same as with regular canisters.

The management canister has its own set of input and output queues. If a best-effort message gets stuck in there for long enough, it will expire and be dropped. If a lot of huge messages are enqueued there, some of the largest ones might get shed. Any attached cycles are lost. (The caller controls whether a call, and thus the resulting request and response, are guaranteed response or best-effort. From the POV of the management canister, it’s just an incoming request or outgoing response like any other.)

So in general best-effort messages can lose the cycles that are attached to them?

Probably not surprising, but still good to explicitly memorise it. Thanks!

Announcing the General Availability of Best-Effort Calls

Best-effort calls (now known as “bounded-wait calls”, as opposed to the already existing “unbounded-wait calls”) are now generally available for use across the IC. The incremental rollout first saw OpenChat switching most of their traffic to bounded-wait calls; then various other dapps across application subnets; finally, system subnets had support for bounded-wait calls enabled a couple of weeks ago.

This has been a tremendous cross-team effort, with implementation driven by the Message Routing, SDK, Motoko and Execution teams, based on constant discussions with and inputs from a wide range of DFINITY and external dapp developers (Cross Chain, Financial Integration, NNS, OpenChat and many more).

How to Use Unbounded Wait Calls

First, if you think you might benefit from a concise refresher on the IC’s messaging model, you can refer to the top post of this thread. For more in-depth treatment, we strongly recommend consulting the reference documentation:

Next, make sure you have Motoko 0.14.2 or newer; or ic_cdk 0.18.0 or newer, if you code your canisters in Rust.

Then, to issue a 1 minute bounded-wait call, all you need to do is:

  • Motoko
    await (with timeout = 60) Counter.set(new_value)
    
  • Rust
    Call::bounded_wait(counter, "set")
        .with_arg(&new_value)
        .change_timeout(60)
        .await
    

For details, check out Motoko async data or Rust inter-canister calls documentation.

What’s Next

There are two goals / functionality gaps that we are already working on addressing.

The first is reliable cycle transfers over bounded-wait calls. In the absence of delivery guarantees, cycles attached to bounded-wait calls can be lost if the message carrying them is dropped (because it timed out or because of high system load). While this is an acceptable risk for cycle amounts on the order of a message execution fee (e.g. when paying for the cost of a downstream call), it is problematic for larger cycle amounts (like loading a canister with cycles). So we need a reliable way to transfer cycles over bounded-wait calls, with any not delivered / not accepted cycles refunded to the caller.

Second is compatibility with / support for legacy, non-idempotent APIs. Safe and correct use of bounded-wait calls (or ingress messages) requires idempotent APIs: if the caller does not learn the outcome of a call, they must be able to issue the same call again; or have a side-channel they can use to learn the outcome (like, e.g. querying a ledger). For cases where canisters cannot (e.g. blackholed) or just will not provide an idempotent API, we must provide a way for callers primarily using bounded-wait calls to reliably interact with such legacy APIs.

We will follow up here as soon as we have something concrete.

4 Likes