Canister Output Message Queue Limits, and IC Management Canister Throttling Limits

timo · November 30, 2022, 8:43am

Ok, but in that case Paul’s reasoning applies, the increment will be rolled back, and the code is correct. I still don’t understand how the code can break as observed in the stress test.

icme · December 2, 2022, 4:25am

@roman-kashitsyn @dsarlis

A few follow ups that just came to mind.

What about adding this API with a limit to the # of principals passed (say 1000 canister principals)? That’s ~33KB.

It’s a huge improvement from a cycles cost perspective, since then only one inter-canister call is required.

Is there a particular reason why this queue capacity limit is currently 500 and not higher (i.e. performance degradation or DDOS prevention?)?

dsarlis · December 2, 2022, 9:17am

What about adding this API with a limit to the # of principals passed (say 1000 canister principals)? That’s ~33KB.

This would alleviate the issue of the result potentially being too large. But I think we would still need to handle what @bogwar mentioned

Further to @dsarlis 's point regarding canister_status : these calls are passed to the subnetwork hosting the canister so batching canister_status would not work w/o a lot of abstraction breaking and low level machinery and bookkeeping (i.e. parsing the message, constructing messages for the different subnetworks and then reconstructing the reply from the replies we would get)

Making this work is non-trivial. We could also say that we can limit this api to only canisters on the same subnet but at this point I’d be questioning whether that API would be really useful.

Is there a particular reason why this queue capacity limit is currently 500 and not higher (i.e. performance degradation or DDOS prevention?)?

I believe the main reason is to ensure that no single queue can potentially take all space, so more like DDOS prevention as you mentioned. A limit has to be there imo, we cannot have unbounded queues. Whether the 500 value is too low or not, I’m not sure I can answer off the top of my head. We’d likely need to experiment with higher values and see if it would be viable from a subnet health pov. Is there a specific value that would be more preferable? 2x? 3x?

icme · December 2, 2022, 10:01am

I see how this would break the existing endpoint interface and require a new endpoint or type definition, but I’m not sure I quite understand the heavy lifting involved in sending out multiple parallel asynchronous requests from the subnet and then collecting the response - isn’t the subnet good at handling this type of stuff when dealing with cross-subnet calls?

In order of preference, being able to batch principals to canister status would be preferred as that would allow me to drastically reduce the canister output queue between my status checking canister and the IC management canister while simultaneously reducing cycle costs.

If this isn’t possible 2X would be a good start for the output queue being raised, but I’d just be throwing more parallel requests at the management canister.

I probably don’t fully understand the plumbing involved in what @bogwar and you are referring to, but it feels like having a batch canister status method would actually reduce load on the management canister, as fewer status calls would be necessary.

The same goes for many canister management methods.

dsarlis · December 2, 2022, 10:48am

in sending out multiple parallel asynchronous requests from the subnet and then collecting the response - isn’t the subnet good at handling this type of stuff when dealing with cross-subnet calls?

What you’re asking is an endpoint which when you call it, it would fan out to multiple messages and then collect the results. This would likely require keeping some call context on the management canister for these requests as well as some logic to be able to know which messages for other subnets we need to send (based on target principals). It’s not that we can’t do it but it would require quite some work. Also, where things get a bit hairy is that normally routing messages to other subnets happens on the message routing layer but in this case the management canister would probably need to do some of it (hence Bogdan’s comment on abstraction breaking).

As I said earlier in the thread, it’s a nice idea but implementing it might be trickier than you initially think. I can bring it up with some folks internally and see if we can sketch out something to gauge the effort level more concretely.

bitbruce · December 3, 2022, 6:32am

Caution is indeed needed if the solution to the issue is too costly.
A more conservative approach would be to

raise the queue limit to 1000, 500 is really too low.
2, provide a system method that allows canister to query the current queue’s available size, so that developers can control the call without touching the limit.

icme · December 5, 2022, 7:29am

Was further looking into queues, and found DEFAULT_OUTPUT_REQUEST_LIFETIME

github.com

dfinity/ic/blob/a52bb83dd6fc99072db070ccc905478f5ce4bee4/rs/replicated_state/src/canister_state/queues.rs#L34


      
              convert::{From, TryFrom},
              ops::{AddAssign, SubAssign},
              sync::Arc,
              time::Duration,
          };
          
          
pub const DEFAULT_QUEUE_CAPACITY: usize = 500;
          
          
/// The default lifetime of a request in OutputQueue from which the deadline
          /// is computed as time + DEFAULT_OUTPUT_REQUEST_LIFETIME.
          pub const DEFAULT_OUTPUT_REQUEST_LIFETIME: Duration = Duration::from_secs(300);
          
          
/// Encapsulates information about `CanisterQueues`,
          /// used in detecting a loop when consuming the input messages.
          #[derive(Clone, Debug, Default, PartialEq, Eq)]
          pub struct CanisterQueuesLoopDetector {
              pub local_queue_skip_count: usize,
              pub remote_queue_skip_count: usize,
              pub skipped_ingress_queue: bool,
          }

There’s also reference to a deadline and requests “timing out”

github.com

dfinity/ic/blob/a52bb83dd6fc99072db070ccc905478f5ce4bee4/rs/replicated_state/src/canister_state/queues/queue.rs#L547


      
                      q: self,
                      current_time,
                  }
              }
          }
          
          
/// Iterator over timed out requests in an OutputQueue.
          ///
          /// This extracts timed out requests by removing them from the queue,
          /// leaving `None` in their place and returning them one by one.
          pub(super) struct TimedOutRequestsIter<'a> {
              /// A mutable reference to the queue whose requests are to be timed out and returned.
              q: &'a mut OutputQueue,
              /// The time used to determine which requests should be considered timed out.
              /// This is compared to deadlines in q.deadline_range_ends.
              current_time: Time,
          }
          
          
impl<'a> Iterator for TimedOutRequestsIter<'a> {
              type Item = Arc<Request>;

Is there a certain point at which inter-canister messages could be dropped (timed out) if the deadline has passed? Let’s say I have a canister or subnet that’s completely overloaded with messages and has a bunch of these messages in my canister’s output queue. What happens to these messages if the 300 second deadline has passed?

derlerd-dfinity1 · December 8, 2022, 8:26am

Note that the protocol guarantees that each request will receive a response. So requests are never dropped. Instead the system produces synthetic reject responses whenever a request can not be delivered/processed.

More precisely, for for timeouts this means that all requests that don’t make it into the subnet-to-subnet stream before the deadline has passed (and therefore were not visible to anybody except the sending canister), the system will produce a synthetic reject response in the same way as it would produce a synthetic reject response when, e.g., the receiver’s input queue is full, the receiving canister/subnet is at memory limits, the receiver is stopped/freezed/traps while processing, and potentially other cases I might be missing now. So everything can be handled in the same way as in these cases and canisters shouldn’t be required to explicitly handle these timeout cases – that is, assuming they handle the other cases correctly, they already implicitly handle timeouts, as a timeout is also just an asynchronous reject with reject code “transient”. Requests in streams currently don’t time out as this would require more work to implement in a way that messaging guarantees given by the IC are not broken (essentially once messages are in streams we don’t know whether the message has been seen/picked up by other subnets/canisters and therefore no longer have an easy way to time them out). Finally, for responses, the protocol guarantees delivery and so these won’t time out at all, irrespective of whether they are in streams or in queues.

derlerd-dfinity1 · January 12, 2023, 9:13am

Hi @bitbruce, the changes that were necessary to also allow calls to self to go to the full queue capacity have meanwhile been made and are part of the release elected to be rolled out this week. Hope this helps to make things smoother for you and other devs.

bitbruce · January 12, 2023, 9:22am

Great work!
We will be working on the canister code upgrade.

icme · January 18, 2023, 6:37pm

For those following this topic thread,

Tomorrow @dsarlis will be giving a talk tomorrow in the Scalability & Performance working group on “Canister queues (ingress, input and output) overview and “case studies of handling large message volumes” - Technical Working Group: Scalability & Performance - #33 by domwoe

claudio · February 4, 2023, 8:32am

Motoko 0.8.1 is out now. The previous release, 0.8.0, introduced the following (breaking) change to let users detect and handle message send failures using try ... catch expressions. These releases are not yet included in dfx to allow time for users to provide feedback.

BREAKING CHANGE

Failure to send a message no longer traps but, instead, throws a catchable Error with new error code #call_error (#3630).

On the IC, the act of making a call to a canister function can fail, so that the call cannot (and will not be) performed.
This can happen due to a lack of canister resources, typically because the local message queue for the destination canister is full,
or because performing the call would reduce the current cycle balance of the calling canister to a level below its freezing threshold.
Such call failures are now reported by throwing an Error with new ErrorCode #call_error { err_code = n },
where n is the non-zero err_code value returned by the IC.
Like other errors, call errors can be caught and handled using try ... catch ... expressions, if desired.

The constructs that now throw call errors, instead of trapping as with previous version of Motoko are:
- calls to shared functions (including oneway functions that return ()).
  These involve sending a message to another canister, and can fail when the queue for the destination canister is full.
- calls to local functions with return type async. These involve sending a message to self, and can fail when the local queue for sends to self is full.
- async expressions. These involve sending a message to self, and can fail when the local queue for sends to self is full.
- await expressions. These can fail on awaiting an already completed future, which requires sending a message to self to suspend and commit state.
(On the other hand, async* (being delayed) cannot throw, and evaluating await* will at most propagate an error from its argument but not, in itself, throw.)

Note that exiting a function call via an uncaught throw, rather than a trap, will commit any state changes and currently queued messages.
The previous behaviour of trapping would, instead, discard, such changes.

To appreciate the change in semantics, consider the following example:
```
actor { 
  var count = 0; 
  public func inc() : async () { 
    count += 1; 
  }; 
  public func repeat() : async () { 
    loop { 
      ignore inc(); 
    } 
  }; 
  public func repeatUntil() : async () { 
    try { 
      loop { 
       ignore inc(); 
      } 
    } catch (e) { 
    } 
  }; 
} 
```
In previous releases of Motoko, calling repeat() and repeatUntil() would trap, leaving count at 0, because
each infinite loop would eventually exhaust the message queue and issue a trap, rolling back the effects of each call.
With this release of Motoko, calling repeat() will enqueue several inc() messages (around 500), then throw an Error
and exit with the error result, incrementing the count several times (asynchronously).
Calling repeatUntil() will also enqueue several inc() messages (around 500) but the error is caught so the call returns,
still incrementing count several times (asynchronously).

The previous semantics of trapping on call errors can be enabled with compiler option --trap-on-call-error, if desired,
or selectively emulated by forcing a trap (e.g. assert false) when an error is caught.

For example,
```
  public func allOrNothing() : async () { 
    try { 
      loop { 
       ignore inc(); 
      } 
    } catch (e) { 
      assert false; // trap! 
    } 
  }; 
```
Calling allOrNothing() will not send any messages: the loop exits with an error on queue full,
the error is caught, but assert false traps so all queued inc() messages are aborted.

icme · February 4, 2023, 4:49pm

Hey Claudio, to clarify - this breaking change is only changing how output queue overflow errors will be handled, correct? (They need to be caught and explicitly trapped if the developer wishes to discard the intermediate state and not have it committed).

My understanding is that other inter-canister or non-output queue async/await errors already demonstrate the intermediate state commit behavior.

…Or is this a complete overhaul of how all async errors are handled?

claudio · February 4, 2023, 5:05pm

Correct.

If a call itself returns an error (by rejecting the message or trapping) the caller’s state will, by necessity, already have been committed (otherwise the call would never have been issued).

icme · October 24, 2024, 4:52pm

I’d like to revive this topic and suggest a 3X increase in per canister output queues, from 500 to 1500.

To get around per canister pair output queues developers are already spinning up multiple canisters, of which there are architectures that already have thousands of canisters sending small messages.

Increasing canister pair output queue limits will allow developers to push more transactions from a single canister without requiring unnecessary canister spin-up, and in doing so will help decrease the number of reasons that a developer would need to create a “fan-out” architecture, which would leading them to take creating additional canisters that wouldn’t be necessary with a larger output queue.

free · October 24, 2024, 5:13pm

I honestly don’t think there are all that many applications that absolutely cannot do without 500+ concurrent calls to the same destination. And I very much doubt that someone has spun up thousands of canisters for the sole purpose of issuing 1000 * 500 concurrent calls to one destination. (For one, a canister cannot have anywhere near that many messages across all of its input queues.)

On the other hand, I can easily see someone misguidedly “retrying” the same call (or making some similar mistake) for as many times as the protocol will allow them to. And an accidental 500 call backlog (times however many copies of the buggy canister happen to be deployed) is already too much. The number was supposed to be a lot lower, but was bumped precisely in order to make it easier for some application (I forget which, but it was never implemented) to take the easy way out of whatever predicament it was in.

The limit is there precisely for the purpose of providing “backpressure”: if my application has absentmindedly sent out hundreds of concurrent requests to the same destination, it should encounter some resistance instead of being allowed to continue until everything falls over. In your case, you intentionally want to send out more than 500 concurrent calls. Which is fine, as you know there are ways around this limit. But accidentally bringing a canister or subnet to its knees should not be the default behavior.

icme · January 17, 2025, 12:33am

We’ve spun up several (a handful) of canisters in order to do just this so far in the past with other services (CycleOps) .

On a different app, icptopup.com, we currently set a limit of 100 canisters topped up per action per user.

We have a service failure limit on our canister top up API of 400 canisters concurrently being topped up to ensure that calls don’t fail due to output queue limitations (500), and are just now starting to hit the 200-300 concurrent calls range.

We could potentially spin-up additional canisters to do this same work, but this would just result in additional latency for the end-user, and wouldn’t change the impact of canister being targeted on the other side of the equation.

Alternatively, we could implement back pressure inside of our canister and hold onto requests in a queue for longer, periodically checking back in before enqueuing them. But this again, would result in additional latency for the end user.

However, especially with the recent performance improvements enabling the new scalable messaging model, a rate limit increase would be a much simpler solution. Are there any chances of an incremental bump in queue size to 750 or 1k?

Topic		Replies	Views
Understanding Canisters Live on the IC Developers	2	59	January 3, 2025
[Discussion] Approaches for preventing canisters from hitting memory limits Developers Discussing	5	610	October 5, 2022
Signature queue for key ecdsa:Secp256k1:key_1 is full Developers	16	256	August 2, 2024
What happens when 1000 users call the same canister method concurrently on the Internet Computer? Rust Discussing	4	122	May 21, 2025
Can some one do a YouTube video on optimizing cross canister calls? Developers	3	565	February 2, 2023

Canister Output Message Queue Limits, and IC Management Canister Throttling Limits

Related topics