Canister Output Message Queue Limits, and IC Management Canister Throttling Limits

Ok, but in that case Paul’s reasoning applies, the increment will be rolled back, and the code is correct. I still don’t understand how the code can break as observed in the stress test.

@roman-kashitsyn @dsarlis

A few follow ups that just came to mind.

What about adding this API with a limit to the # of principals passed (say 1000 canister principals)? That’s ~33KB.

It’s a huge improvement from a cycles cost perspective, since then only one inter-canister call is required.

Is there a particular reason why this queue capacity limit is currently 500 and not higher (i.e. performance degradation or DDOS prevention?)?

What about adding this API with a limit to the # of principals passed (say 1000 canister principals)? That’s ~33KB.

This would alleviate the issue of the result potentially being too large. But I think we would still need to handle what @bogwar mentioned

Further to @dsarlis 's point regarding canister_status : these calls are passed to the subnetwork hosting the canister so batching canister_status would not work w/o a lot of abstraction breaking and low level machinery and bookkeeping (i.e. parsing the message, constructing messages for the different subnetworks and then reconstructing the reply from the replies we would get)

Making this work is non-trivial. We could also say that we can limit this api to only canisters on the same subnet but at this point I’d be questioning whether that API would be really useful.

Is there a particular reason why this queue capacity limit is currently 500 and not higher (i.e. performance degradation or DDOS prevention?)?

I believe the main reason is to ensure that no single queue can potentially take all space, so more like DDOS prevention as you mentioned. A limit has to be there imo, we cannot have unbounded queues. Whether the 500 value is too low or not, I’m not sure I can answer off the top of my head. We’d likely need to experiment with higher values and see if it would be viable from a subnet health pov. Is there a specific value that would be more preferable? 2x? 3x?

I see how this would break the existing endpoint interface and require a new endpoint or type definition, but I’m not sure I quite understand the heavy lifting involved in sending out multiple parallel asynchronous requests from the subnet and then collecting the response - isn’t the subnet good at handling this type of stuff when dealing with cross-subnet calls?

In order of preference, being able to batch principals to canister status would be preferred as that would allow me to drastically reduce the canister output queue between my status checking canister and the IC management canister while simultaneously reducing cycle costs.

If this isn’t possible 2X would be a good start for the output queue being raised, but I’d just be throwing more parallel requests at the management canister.

I probably don’t fully understand the plumbing involved in what @bogwar and you are referring to, but it feels like having a batch canister status method would actually reduce load on the management canister, as fewer status calls would be necessary.

The same goes for many canister management methods.

in sending out multiple parallel asynchronous requests from the subnet and then collecting the response - isn’t the subnet good at handling this type of stuff when dealing with cross-subnet calls?

What you’re asking is an endpoint which when you call it, it would fan out to multiple messages and then collect the results. This would likely require keeping some call context on the management canister for these requests as well as some logic to be able to know which messages for other subnets we need to send (based on target principals). It’s not that we can’t do it but it would require quite some work. Also, where things get a bit hairy is that normally routing messages to other subnets happens on the message routing layer but in this case the management canister would probably need to do some of it (hence Bogdan’s comment on abstraction breaking).

As I said earlier in the thread, it’s a nice idea but implementing it might be trickier than you initially think. I can bring it up with some folks internally and see if we can sketch out something to gauge the effort level more concretely.

2 Likes

Caution is indeed needed if the solution to the issue is too costly.
A more conservative approach would be to

  1. raise the queue limit to 1000, 500 is really too low.
    2, provide a system method that allows canister to query the current queue’s available size, so that developers can control the call without touching the limit.
3 Likes

Was further looking into queues, and found DEFAULT_OUTPUT_REQUEST_LIFETIME

There’s also reference to a deadline and requests “timing out”

Is there a certain point at which inter-canister messages could be dropped (timed out) if the deadline has passed? Let’s say I have a canister or subnet that’s completely overloaded with messages and has a bunch of these messages in my canister’s output queue. What happens to these messages if the 300 second deadline has passed?

2 Likes

Note that the protocol guarantees that each request will receive a response. So requests are never dropped. Instead the system produces synthetic reject responses whenever a request can not be delivered/processed.

More precisely, for for timeouts this means that all requests that don’t make it into the subnet-to-subnet stream before the deadline has passed (and therefore were not visible to anybody except the sending canister), the system will produce a synthetic reject response in the same way as it would produce a synthetic reject response when, e.g., the receiver’s input queue is full, the receiving canister/subnet is at memory limits, the receiver is stopped/freezed/traps while processing, and potentially other cases I might be missing now. So everything can be handled in the same way as in these cases and canisters shouldn’t be required to explicitly handle these timeout cases – that is, assuming they handle the other cases correctly, they already implicitly handle timeouts, as a timeout is also just an asynchronous reject with reject code “transient”. Requests in streams currently don’t time out as this would require more work to implement in a way that messaging guarantees given by the IC are not broken (essentially once messages are in streams we don’t know whether the message has been seen/picked up by other subnets/canisters and therefore no longer have an easy way to time them out). Finally, for responses, the protocol guarantees delivery and so these won’t time out at all, irrespective of whether they are in streams or in queues.

2 Likes

Hi @bitbruce, the changes that were necessary to also allow calls to self to go to the full queue capacity have meanwhile been made and are part of the release elected to be rolled out this week. Hope this helps to make things smoother for you and other devs.

2 Likes

Great work!
We will be working on the canister code upgrade.

1 Like

For those following this topic thread,

Tomorrow @dsarlis will be giving a talk tomorrow in the Scalability & Performance working group on “Canister queues (ingress, input and output) overview and “case studies of handling large message volumes” - Technical Working Group: Scalability & Performance - #33 by domwoe

4 Likes

Motoko 0.8.1 is out now. The previous release, 0.8.0, introduced the following (breaking) change to let users detect and handle message send failures using try ... catch expressions. These releases are not yet included in dfx to allow time for users to provide feedback.

  • BREAKING CHANGE

    Failure to send a message no longer traps but, instead, throws a catchable Error with new error code #call_error (#3630).

    On the IC, the act of making a call to a canister function can fail, so that the call cannot (and will not be) performed.
    This can happen due to a lack of canister resources, typically because the local message queue for the destination canister is full,
    or because performing the call would reduce the current cycle balance of the calling canister to a level below its freezing threshold.
    Such call failures are now reported by throwing an Error with new ErrorCode #call_error { err_code = n },
    where n is the non-zero err_code value returned by the IC.
    Like other errors, call errors can be caught and handled using try ... catch ... expressions, if desired.

    The constructs that now throw call errors, instead of trapping as with previous version of Motoko are:

    • calls to shared functions (including oneway functions that return ()).
      These involve sending a message to another canister, and can fail when the queue for the destination canister is full.
    • calls to local functions with return type async. These involve sending a message to self, and can fail when the local queue for sends to self is full.
    • async expressions. These involve sending a message to self, and can fail when the local queue for sends to self is full.
    • await expressions. These can fail on awaiting an already completed future, which requires sending a message to self to suspend and commit state.

    (On the other hand, async* (being delayed) cannot throw, and evaluating await* will at most propagate an error from its argument but not, in itself, throw.)

    Note that exiting a function call via an uncaught throw, rather than a trap, will commit any state changes and currently queued messages.
    The previous behaviour of trapping would, instead, discard, such changes.

    To appreciate the change in semantics, consider the following example:

    actor { 
      var count = 0; 
      public func inc() : async () { 
        count += 1; 
      }; 
      public func repeat() : async () { 
        loop { 
          ignore inc(); 
        } 
      }; 
      public func repeatUntil() : async () { 
        try { 
          loop { 
           ignore inc(); 
          } 
        } catch (e) { 
        } 
      }; 
    } 
    

    In previous releases of Motoko, calling repeat() and repeatUntil() would trap, leaving count at 0, because
    each infinite loop would eventually exhaust the message queue and issue a trap, rolling back the effects of each call.
    With this release of Motoko, calling repeat() will enqueue several inc() messages (around 500), then throw an Error
    and exit with the error result, incrementing the count several times (asynchronously).
    Calling repeatUntil() will also enqueue several inc() messages (around 500) but the error is caught so the call returns,
    still incrementing count several times (asynchronously).

    The previous semantics of trapping on call errors can be enabled with compiler option --trap-on-call-error, if desired,
    or selectively emulated by forcing a trap (e.g. assert false) when an error is caught.

    For example,

      public func allOrNothing() : async () { 
        try { 
          loop { 
           ignore inc(); 
          } 
        } catch (e) { 
          assert false; // trap! 
        } 
      }; 
    

    Calling allOrNothing() will not send any messages: the loop exits with an error on queue full,
    the error is caught, but assert false traps so all queued inc() messages are aborted.

3 Likes

Hey Claudio, to clarify - this breaking change is only changing how output queue overflow errors will be handled, correct? (They need to be caught and explicitly trapped if the developer wishes to discard the intermediate state and not have it committed).

My understanding is that other inter-canister or non-output queue async/await errors already demonstrate the intermediate state commit behavior.

…Or is this a complete overhaul of how all async errors are handled?

Correct.

If a call itself returns an error (by rejecting the message or trapping) the caller’s state will, by necessity, already have been committed (otherwise the call would never have been issued).

1 Like