Inter-Canister Query Calls (Status: done)

The key insight here is that the update calls are happen “on-chain” whereas query calls are happening off-chain.

If you think about the blockchain as a git repo, each time we execute an update call on a canister, we are adding a new commit to the main branch of the repo.

Each time we are executing a query (or a ICQC), we are creating a branch in the repo. We know that we will eventually discard the whole branch (when the ICQC is done) but while the ICQC is ongoing, we have to maintain it so that we can serve the calls properly. And then if we imagine that there are a bunch of ICQCs happening in parallel and the call graphs are quite complex or deep, then that is a lot of state that the node has to maintain.

That’s why the approach of not maintain intermediate state during the ICQC will help in significantly reducing the amount of resources required to serve ICQCs.

8 Likes

Somewhat tangential but perhaps related, I noticed that “F+1 query calls” is on the roadmap for a Q4 2022 release.

IIUC, F := upper bound on the # of corrupt replicas.

My understanding is that ingress update calls (and inter-canister calls) currently go through N - F consensus, where N := # of replicas in a subnet, whereas ingress query calls don’t go through any consensus.

Since N - F consensus is a higher bar than F + 1 consensus, is it accurate to say that this feature will enable query calls that take longer to execute than current ingress queries but faster to execute than full-blown updates? If so, are inter-canister queries the intended use case for this? If not, what’s the purpose of this feature?

(Just throwing out guesses, please let me know if I’m completely off-base.)

2 Likes

@akhilesh.singhania @ulan

I’m trying to understand the ICQC “state explosion” problem better.

  1. Does your same-subnet ICQC prototype use the same input and output queues that regular ingress and inter-canister calls use? In this case, ICQCs would “compete” with the update calls for canister execution. Or do they bypass this queue mechanism?

  2. Currently, does a query call spawn a new thread on a replica? (By the way, I’m curious if these threads all still run on a single core, or can they now run on multiple cores?) If so, maybe another approach is to maintain a thread pool that processes query requests from a shared queue? This could limit the impact of a “state explosion” to the size of the thread pool.
    a. Then, we could also add a restriction that ICQCs can only be initiated from either an ingress query call or another ICQC. This would reduce the likelihood that a single ICQC monopolizes a thread (and all of the threads depending on it).
    b. Another idea is to put a hard cap on query execution time. If that cap is exceeded, that query—and all its child ICQCs—fail, and an error is returned to the client.
    c. One more idea: while a query is waiting on an ICQC to return a response, that thread could execute another query in the meantime. Just like how the JavaScript event loop deals with async function calls.

WDYT? Or am I misunderstanding the problem?

1 Like

Queries and updates use complete different mechanisms. The problem is about holding on to dirty state of canisters while they wait for replies.

First let’s consider update calls. If canister executes an update call at state A and modifies its state to A’, the new state is committed and the state of the canister now becomes A’. All future update calls will run again A’. The system is free to drop A whenever it is convenient for it.

And when you consider the current situation without ICQC, we see that queries will either run against state A or state A’. So when state A’ is available, the system can let new queries run against A’ and when the last queries running against A finish, the system can safely drop it.

Note that currently without ICQC and with upper limits on how long queries can run for, we have a simple upper bound on how long we need to hold onto older states of canisters.

Problem 1 with ICQC

Let’s say the user queries canister 1 at state A. Canister 1 sends ICQC to canister 2 and so on. The system has to hold on to state A of canister 1 till the entire ICQC call graph is finished. If we allow for call graphs with unlimited depth (which is currently the case with update call graphs), there would be no upper limit on how long the system has to hold onto state A for canister 1.

Problem 2 with ICQC

The problem above is actually much worse. Note that when canister 1 executed the query at state A, on top of sending queries to other canisters, it also modified its state to A’. The system has to hold on to not state A but actually A’ because when the replies come back, they are expecting to run against A’.

This might not seem like a big problem immediately as the number of states that the system is maintaining is not actually changing (A’ instead of A). However, if you consider how the states are maintained in the system, this is a huge problem i.e. we maintain different states of the canister by maintaining deltas from previous versions of the states. So A’ is actually maintained as A + delta. So what we are seeing here is that each time a canister makes an ICQC, we have to maintain a separate additional delta for that canister. And the more canisters are involved in a ICQC call graph, the more deltas that system has to maintain. And without bounding the call graph, we cannot bound the number of deltas we have to maintain.

Further, note that all the above deltas that are maintained for queries, need to be maintained only temporarily. Once a ICQC finishes, all the deltas can be dropped. They also need to have high availability as generally we expect queries to run fairly quickly. This means that we need to have access to a large amount of storage that also has low latency access behaviour and can be recycled quickly. All of these are very challenging problems.

I hope the above gives some appreciation of how many different versions of a canister’s state (i.e. deltas) we will have to maintain if we implement ICQC naively i.e. with the same guarantees and semantics as we have for update messages.

This is why @ulan and I are proposing a ICQC design where the replies run against a “clean” state of the canister. With that design, we are not maintaining additional canister states that we have to maintain today. This will make programming with ICQC more difficult as has already been discussed above.

3 Likes

Hm, “more difficult” sounds a bit like an understatement. Do we have a story for how such a semantics could be reconciled with async in any form or shape?

1 Like

Note that currently without ICQC and with upper limits on how long queries can run for, we have a simple upper bound on how long we need to hold onto older states of canisters.

I wasn’t aware there currently was an upper limit on how long a query can run for. Is that limit measured in time or cycles?

If we allow for call graphs with unlimited depth (which is currently the case with update call graphs), there would be no upper limit on how long the system has to hold onto state A for canister 1.

Maybe we should add an upper limit then? For example, a query and any ICQCs that it kicks off can only run for a grand total of X seconds (or cycles).

And the more canisters are involved in a ICQC call graph, the more deltas that system has to maintain. And without bounding the call graph, we cannot bound the number of deltas we have to maintain.

Why don’t we just impose a limit on the # of concurrent queries a single replica can process (and add pending queries to a queue)? For example, executing queries on a per-replica thread pool will naturally impose a limit. The call graph doesn’t need to be bounded then. Every canister’s replicas in the chain of ICQCs are responsible for limiting their own query concurrencies. Together with a total time/cycle limit on a query, this should prevent the state explosion problem, I think?

2 Likes

We don’t think it will be possible. If you do not see a mechanism either, then this will indeed be very tricky. Maybe canisters cannot use the async paradigm for ICQC.

1 Like

The limit is in instructions just like there is an upper limit on how long update messages can run for. Otherwise, a canister can trigger an infinite query and consume resources forever.

We have upper limits on how long a single query can run for. Putting upper bounds on ICQCs breaks composibility. Essentially it can mean that a call graph like A → B → C → D may fail but C → D may succeed. So you are unable to compose functionality by combining smaller functionalities together.

We precisely have this right now. Currently we can process at most 4 queries in tandem. We explicitly do not have a queue because having a queue does not help. What do you do when it is full.

The problem with ICQCs is that when a canister on subnet A sends a ICQC to a canister on subnet B, what do you then on subnet A? Do block the query thread on subnet A? If you do, then you create a DoS vector. If you do not, you need to store the state of A as discussed above.

2 Likes

At an absolute minimum, the API would need to enable attaching temporary out-of-band state (of arbitrary size) to each ICQC, which the canister can query upon receiving the reply.

Because if there was no way for temporary state to survive the ICQC (other than just the i32 env value supplied), then for most practical purposes, ICQC’s would be a completely useless feature – a canister would not be able to put the reply it receives into any useful context.

2 Likes

That makes sense. Maybe there is a way for a canister to hand off some state to the replica to hold on to till the reply comes back.

1 Like

We precisely have this right now. Currently we can process at most 4 queries in tandem. We explicitly do not have a queue because having a queue does not help. What do you do when it is full.

I’m curious if 4 is temporary and will increase, or is that there to stay?

The problem with ICQCs is that when a canister on subnet A sends a ICQC to a canister on subnet B, what do you then on subnet A? Do block the query thread on subnet A? If you do, then you create a DoS vector. If you do not, you need to store the state of A as discussed above.

Good question, although I do wonder why that’s not an issue with same-subnet ICQCs? In both cases there is a latency (although it’s exacerbated with cross-subnet).

I’d assume you would, as you said, hand off some state to the replica to hold onto in some kind of “heap”. And if that heap is full on that replica, then the canister can respond to the query call with a 50x error. (Or if the user is fine with throwing away state or does not use ICQCs, then the canister can still process the query.)

1 Like

I summarized the current state of ICQC and the trade-off space in this document.

Feedback and comments are very much appreciated!

3 Likes

Pasting the contents of the document here for better readability.

Background

The Internet Computer (IC) has two types of messages: updates and queries. As shown in the table below, queries are fast because they are read-only and don’t have to go through consensus.

Update Query
State changes persisted :heavy_check_mark:
Goes through consensus :heavy_check_mark: ✘*
Low latency, high throughput :heavy_check_mark:
Inter-canister calls :heavy_check_mark: ✘**

(*) A user may choose to run a query in an update context, where queries are executed on all replicas along with update messages. The state changes of queries are discarded, but the results go through consensus. We will refer to queries as replicated and non-replicated depending on whether they run in an update context or not. Replicated queries should not be confused with certified queries that also go through consensus but not in an update context. Note that certified queries currently exist only as an idea and are not implemented.

An update message can call methods of other canisters even if they are located on different subnets. Such inter-canister calls are essential for composing canisters into scalable applications.

(**) Queries do not support inter-canister calls. There is an incomplete implementation of ICQC that is enabled on verified subnets, but it is not generally available for the reasons explained below.

Requirements

The main requirement for ICQC is consistency with the existing inter-canister calls in update messages. Developers already know how inter-canister calls work and have certain expectations. Specifically:

A. [async/await] A call in a query should have the same semantics as a call in an update message. More concretely, it should be possible to support async/await for query calls.
B. [caller-id] The callee of an inter-canister call should be able to read and trust the identity of the caller.
C. [replicated mode] A call in a query should work both in replicated and non-replicated modes.
D. [cross-subnet] It should be possible to call a canister in another subnet.

In addition to the four consistency requirement above, we want queries to remain fast and to not consume a lot of memory:

E. [fast execution] Queries are faster than update messages because they don’t need to keep track of modified memory pages. Ideally, ICQC does not regress this.
F. [low memory usage] Ideally, ICQC does not introduce additional memory overhead.

Note that requirements E and F are important for stability and availability of IC.

Prototype Implementation

Verified application subnets have an incomplete prototype implementation of ICQC. The prototype was implemented under time pressure before the launch of IC. There was not enough time to think through the design and to write a specification. The prototype satisfies only two requirements: [async/await] and [caller-id]. Other requirements are not satisfied:

  • it works only in non-replicated mode,
  • it works only for same-subnet calls,
  • it is up 2x slower because of the need to keep track of modified pages.
  • it has to hold on to the state until all calls complete, so in some cases it may double the memory usage of the replica.

Trade-off Space

The bad news is that the requirements are in conflict with each other. We identified two trade-off pairs:

  1. [async/await] vs [replicated mode, fast execution, low memory usage].
  2. [caller-id] vs [cross-subnet].

In each trade-off pair we can choose only one alternative. For example, the prototype implementation corresponds to [async/await] + [caller-id]. It seems that [async/await] is non-negotiable. Sacrificing it would result in a strange, non-intuitive programming model. Given that, the only other viable combination is [async/await] + [cross-subnet], where all inter-canister query calls are effectively anonymous/public.

Explanation of Trade-off 1

Consider the following generic async query function that calls another query:

async fn query_foo(input: Input) -> Output {
    let data = pre_process(input);
    let result = call(canister_bar, "query_bar", data).await;
    post_process(result)
}

It prepares some data, issues a call to another query, awaits the result, and finally returns a processed result. IC executes the functions as two separate messages. The first message runs from the start of the function to the await point. At the await point, the runtime of the language (Rust/Motoko) is going to save all necessary information in the WebAssembly memory (future, task, environment) such that when the reply message of the call comes back, the execution can continue from the await point to the end of the function.

The crucial part here is “save all necessary information in memory”, which means that the state changes are important and can no longer be discarded. Thus, a query becomes similar to an update message until the call completes. Figure 1 visualizes the effect of ICQC on canister states. Normally, the state of a canister evolves linearly, changing only from one round to the next. A query that supports ICQC introduces a branch in the chain of states. This doesn’t work in replicated mode because the replicated state supports only one state per canister. The need to keep track of state changes makes execution slower and increases memory usage.

Figure 1. A query call creates a branch in the linear chain of canister states.

Explanation of Trade-off 2

All messages in IC are signed by the key of the sender, which can be either a user or a subnet. Figure 2 shows the scenario where a user sends a non-replicated query to a single replica. If the query calls another canister on another subnet, then that message cannot not be signed because the replica does not have the key of the subnet (by design). This means that the callee cannot trust the id of the caller and has to treat the incoming query as anonymous or public.

Figure 2. The user sends a non-replicated query to one replica in the first subnet, which in turn sends an inter-canister query to a replica in another subnet.

Conclusion

Based on the trade-off space, we have the following options:

  1. ICQC in non-replicated mode and only on the same subnet. This corresponds to the existing prototype implementation. If we go with this option, then we would enable the prototype everywhere.
  2. ICQC in non-replicated mode without caller-id. In this case the callee canister has to assume that the request is anonymous and respond only with publicly visible data, which greatly reduces the utility of ICQC.
  3. Do not support ICQC. Developers would have to move logic that calls other canisters to the client side.
10 Likes

Thanks for the detailed post―very insightful. I wasn’t aware of the [caller-id] vs [cross-subnet] trade-off until reading this.

Note that certified queries currently exist only as an idea and are not implemented.

I was under the impression that certified queries have been available this entire time. If so, then I don’t see the benefit of replicated ingress queries, as users could instead choose to make certified queries, which are faster than but still as secure as replicated queries.

Note that requirements D and E are important for stability and availability of IC.

Did you mean requirements 5 and 6? I don’t see D or E anywhere.

It seems that [async/await] is non-negotiable. Sacrificing it would result in a strange, non-intuitive programming model.

In that case, how will you solve the unbounded memory problem that you’ve identified as a major concern? You mentioned it may double the memory usage, but is that due to a hard cap that you will implement as part of this proposal or due to other reasons?

It’s also interesting that ICQC will be 2x slower…

Figure 1. A query call creates a branch in the linear chain of canister states.

In this figure, isn’t 2’ equal to 2? Because until Canister B responds, the forked query has suspended execution.


Based on your analysis, I think option 1 is the most sensible. Since developers cannot currently choose which subnet to deploy their canister onto AFAIK, I think implementing that feature would be a blocker for this.

2 Likes

These are certified variable. There is also an idea of certified or secure queries where queries would run on some nodes that then reach consensus among themselves without that being part of the blocks. This would ensure that the result of the query can be trusted even without certified variables.

Did you mean requirements 5 and 6? I don’t see D or E anywhere.

Thanks fixed. Letters were lost when copying the document contents.

In that case, how will you solve the unbounded memory problem that you’ve identified as a major concern? You mentioned it may double the memory usage, but is that due to a hard cap that you will implement as part of this proposal or due to other reasons?

Yeah, we have to introduce some hard cap and abort very long-running queries. If we only support same-subnet calls then this is less of a problem because we don’t have network latency.

In this figure, isn’t 2’ equal to 2? Because until Canister B responds, the forked query has suspended execution.

2’ contains changes in the memory (futures/tasks/environment) necessary to continue execution of the async query method after the await point.

Based on your analysis, I think option 1 is the most sensible. Since developers cannot currently choose which subnet to deploy their canister onto AFAIK, I think implementing that feature would be a blocker for this.

I am leaning towards option 1 as well.

2 Likes

Currently a user can run a query in replicated mode to ensure trustworthy result even without certified variables. If ICQC is only available in non-replicated mode, then we lose that nice property.

Thanks for the summary, @ulan!

It’s not overly surprising that caller id’s are in the way. Essentially, caller id and access checking by looking at caller id’s are a form of security by call stack inspection, which has long been a known source of problems, e.g., in the JVM or JavaScript. Needless to say that this could be retired if the IC adopted a modern, capability-based access model. :slight_smile:

9 Likes

If option 1 precludes the possibility of replicated queries, does it still allow for the possibility of certified queries, if and when that gets implemented?

Also, I hate to ask this, but assuming option 1 is favored by the community (perhaps in a proposal vote), when do you expect ICQC to be generally available? Would it be in a matter of 1-2 months or much much longer?

1 Like

Then what happens when a multi-canister application making these non-replicated queries scales beyond a subnet? Do the queries break, or pursue a different mechanism if on different subnets (i.e. a fallback)?

As a “hack” to get around this “publicly visible data access” issue, would this developer defined solution work instead of relying on the IC?

For example, what if the developer has the caller canister contract code generate a key that can be sent to the callee, and then the caller and callee canister can both store this key and agree upon it/update it as needed. Then anytime the caller queries the callee it would send this key as a function parameter, and the callee would verify that all incoming queries have this key before executing.

An improved version of this same idea would be instead of just a single key to use a public/private key pair that is used, where the callee has the public key and a transformation function, and the caller sends the private key as a function parameter.

2 Likes

About the second trade off: isn’t it good enough to have the replica of the calling canister sign with it’s node key, and the called replica check that this is indeed a signature of one replica of the subnet responsible for the calling canister? The called replica should have the necessary information in the registry.

It seem that the worst that can happen is that a malicious replica on subnet of canister X can obtain data that is confidential and should only be visible to X. But we are already trusting that replica to keep the state of X confidential, so this doesn’t seem to weaken the IC’s guarantees substantially, does it?

4 Likes