Canister Output Message Queue Limits, and IC Management Canister Throttling Limits

This makes sense, thanks @dsarlis and @bogwar for the additional context and explanation!

Instead of calls coming from a single canister, what if I had 500 canisters that are all batching calls (300 at a time) to the IC Management canister. What if this number of canisters making batch calls was raised to 10,000 canisters?

I guess what I’m trying to get at is, does the IC Management canister have any load limitations? I’ve heard that technically the IC Management canister is not a “canister”, so I’m curious about how it balances or queues up load.

@icme It’s not that different than if you wanted to hit some other canister with this load. The queues are between pairs of canisters as already mentioned earlier in the thread. This means that if you have N canisters trying to hit the management canister then you’ll have N queues on the management canister each to hold the incoming messages from each of the N canisters. We do not have a limit on N, but as we’ve said each queue has a default capacity of 500 messages. The next limit you might hit then is the subnet message memory capacity.

Now, there are some more technicalities if you want to go deeper (I’m unclear how much of this is theoretical or you have actual use cases in mind). E.g. if your target canisters are on different subnets, then hitting the management canister means that eventually the messages are routed to each subnet hosting the target canister, so you get some more capacity because of that (basically you take up the queue for the management canister on different subnets). Also, if you are doing install_code messages, we apply an extra rate limit on them if they’ve consumed too many instructions.

1 Like

@roman-kashitsyn

with the REPL I linked here I can’t even get close to 500 messages. Can you explain why that is (and maybe answer my other questions :angel: )?

And when running this REPL with 200 as an argument, the following error appears.

Server returned an error:
Code: 400 ()
Body: Specified ingress_expiry not within expected range:
Minimum allowed expiry: 2022-10-21 15:32:10.019462710 UTC
Maximum allowed expiry: 2022-10-21 15:37:40.019462710 UTC
Provided expiry: 2022-10-21 15:32:08.119 UTC
Local replica time: 2022-10-21 15:32:10.019464171 UTC

How do I interpret that?

It looks as if the agent signs the signature request for the state read (i.e. for polling the response) when it makes the call, and when the call takes a while (e.g. a loop around await) does not extend the expiry in the read request, and after a while it expires.

TL;DR: possibly a bug in the agent

3 Likes

Thanks Joachim! I can confirm that calling it from dfx indeed works. Tagging @kpeacock because I think he’s the owner of the agent.

Oh, that’s an interesting theory - I’ll see if I can reproduce this with a new test

1 Like

When developing complex projects, a number of internal asynchronous functions may be refactored to achieve functional reusability. For example

private func fun1() : async Nat{
    await ledger.transfer(..)          //cross-canister call
   //...
};
private func fun2() : async Nat{
    // ...
    await fun1();
};
private func fun3() : async Nat{
    // ...
    await fun2();
};
private func fun4() : async Nat{
    // ...
    await fun3();
};
private func fun5() : async Nat{
    // ...
    await fun4();
};
public shared func run() : async Nat{
    ignore fun5();
    //...
};

The problem is that when 100 users call the canister at the same time, the number of Output Message Queues may exceed the limit of 500, and some await calls may be trapped.

My views:

  1. The await call is classified as outcall and innercall, and the innercall should not be so restrictive.
  2. Optimise the await role of private function calls. As in the above example, it is sufficient to make await ledger.transfer(…) act asynchronously, ignore other await.
    An effect like this.
private func fun1() : Nat{
    await ledger.transfer(..)           //cross-canister call
   //...
};
private func fun2() : Nat{
    // ...
   fun1();
};
private func fun3() : Nat{
    // ...
   fun2();
};
private func fun4() : Nat{
    // ...
    fun3();
};
private func fun5() : Nat{
    // ...
    fun4();
};
public shared func run() : async Nat{
    fun5();
    // ....
};

OR,

private func fun1() : inner async Nat{
    await ledger.transfer(..)        //cross-canister call
    //...
};
private func fun2() : inner async Nat{
    // ...
    inner await fun1();
};
private func fun3() : inner async Nat{
    // ...
    inner await fun2();
};
private func fun4() : inner async Nat{
    // ...
    inner await fun3();
};
private func fun5() : inner async Nat{
    // ...
    inner await fun4();
};
public shared func run() : async Nat{
    ignore fun5();
    //....
};
  1. Allows futures to be returned in a synchronous function and saved in a global variable. For example:
private var f : ?Future<Nat> = null;
private func fun() : Nat{
    // ...
    f :=  ?ledger.transfer(..);          // no await
    //...
};

There is a new solution: add the countAwaitingCalls() method inside the ExperimentalInternetComputer Module so that Canister can limit the entry of new requests.

1 Like

In your original example, each call uses await, which means it will schedule an outgoing self call message, and end the current call. So the total number of outstanding messages do not increase.

It is only when you use ignore fun2() for example, it will schedule more than one outgoing calls.


Edit: I should add that calls like await fun2() will also reserve resources (e.g. place in the input queue) to make sure when fun2() returns a value, it will be processed. So nested await calls do consume more resources than a single one.

3 Likes

Yes, Output messages are accumulated whenever an ignore funN() is present in the call chain. This becomes uncontrollable when many users access it at the same time.

1 Like

Hello everyone,

I talked to some people today to collect some recommendations on how to handle issues with too many outstanding messages filling up queues. Two general recommendations to be followed when aiming for scalable dapps that call other canisters came up quite consistently:

  • Make sure that the dapp maintains a counter or something similar on how many outstanding requests it has, and explicitly handle the situation where too many calls would be in flight at the same time. @roman-kashitsyn agreed to follow up with details how this is (plannded to be) done for the ckBTC canister.
  • If the design of the dapp allows to batch together some of the calls to external canisters, aim to batch them together. For example, if there are multiple calls to the ledger involving the same account, it might be possible to batch them together and only do one transfer.

There are also certain things that the IC protocol could do differently. The things identified here seem to be in line with the suggestions already brought up earlier in this thread. However, I want to stress that these measures will not really help with scalability as these would only bump limits by an order of magnitude or even less, but limits would still be easily hit as soon as, say, 1000 instead of 100 users try to do something. These things are:

  • Investigate whether it can be made easier to make nested function calls in Motoko without accumulating reservations, or whether there is an alternative pattern one could use. @PaulLiu already provided some pointers above, and @claudio agreed to follow up on details on this and what could be done.
  • Calls to self are currently treated in the same way as calls to other canisters. This means that there is a reservation for the response made in the input queue and in the output queue, which means that for calls to self effectively only have half the queue capacity available. This item is already in our list of backlog tasks and we will look into whether this can be picked up soon.
2 Likes

I think successive nested SELF calls can be optimised into one call.

call:fun5() -> call:fun4() -> call:fun3() -> call:fun2() -> call:fun1() -> out-call: ledger.transfer() ... 
return:fun5() <- return:fun4() <- return:fun3() <- return:fun2() <- return:fun1() <- out-return: ledger.transfer() <-

It can be optimised as below in the queue.

call:funs() -> out-call: ledger.transfer() ... 
return:funs() <- out-return: ledger.transfer() <-

In fact, the above 5 functions could be written as one function. But when coding, there is a need for code refactoring and reuse.

1 Like

Yes, this point is what @claudio will provide more details on. This is what I aimed to describe in the bullet point “Investigate whether …”.

What I meant to describe with the bullet point on reservations for responses in the snippet you are citing in your previous message is something that could be improved on the protocol level thats unrelated to how these things are handled in Motoko: roughly speaking, the protocol currently only allows DEFAULT_QUEUE_CAPACITY/2 requests in flight to self while there can be DEFAULT_QUEUE_CAPACITY requests in flight to other canisters. This is because the protocol doesn’t distinguish between local and remote canisters; distinguishing between them could provide 2x more space for messages to self.

1 Like

I will write a longer response when I get a chance, but, for now, to avoid the overhead of async/await associated with local functions that need to send messages, you need to remove those functions and inline them into their call sites.

I agree this is not good and have even proposed and implemented solutions to this problem in the past but they were felt to be too risky, blurring the distinction between await and state commit points.

I’ll elaborate on this in another reply, but fully agree that the current situation is not good enough for code-reuse and abstraction.

I’m happy to revisit addressing this, but there is no quick fix beyond inlining the calls to avoid the redundant async/await.

This is important. motoko is not a toy, not just for writing demos. motoko needs to meet the needs of engineering.
Its risks can be improved by good programming habits, good IDE tools.

Calls to functions of smart contracts in EVM are also divided into internal and external calls.

1 Like

FTR, Support direct abstraction of code that awaits into functions, without requiring an unnecessary async · Issue #1482 · dfinity/motoko · GitHub is the original issue that discussed this along with links to the PR’s that fixed it that were then deemed to risky.

Yes
The introduction of new semantic expressions is a good solution. For example inner await, inner async.
inner await is not a data commit point, but it may have an await data commit point inside it.

2 Likes

Indeed, FTX Storms, the centralization is facing more and more challenges in the foreseeable future and we need to be prepared for these users who are moving to decentralization!
Looking forward to detailed reports and what preparations and changes we need to make(as soon as possible!) to support tens of millions(even more,yes even more) of users!

“Canister trapped explicitly: could not perform self call" error caused by input/output message queue limitation cannot be caught by try-catch now.

This can break data consistency.

For example

private stable var n: Nat = 0;
private func fun() : async (){
  try{
      n += 1;  // Here it has been executed
      let res = await fun1(); // trapped by input/output message queue limitation
      // ...  // Here the code will not be executed
  }catch(e){
      n -= 1;     // Here the code will not be executed
  };
};
1 Like