Inter-Canister Query Calls (Community Consideration)

Sorry for the delayed response. I was on vacation.

The latency of a cross-subnet query will be higher compared to the same-subnet query. The main issue I see is that it may take a long time until cross-subnet queries are supported. We have a prototype for same-subnet queries, but there is no clear solution for cross-subnet queries yet. It will likely take several months to come up with the solution and implement it.

I wonder if there is a way to unblock you in the meantime. Did you consider moving the cross-canister communication to the client-side JavaScript in the browser? So that JavaScript calls your canister and another canister and combines the results? Alternatively, can you use update calls instead of query calls. The latency will be higher, but cross-subnet/cross-canister update calls are already supported now.

Yes, I think that should be OK. The called query would be a replicated query and would need to use the latest state anyway.

3 Likes

We have a prototype for same-subnet queries, but there is no clear solution for cross-subnet queries yet.

If I understand correctly and you already have a prototype for same-subnet, inter-canister queries, then when will that prototype be productionized and fully launched on the mainnet?

I thought inter-canister queries were too difficult to implement. I’m surprised you already have a prototype for it.

2 Likes

The prototype works only for same-subnet queries. To make it production ready we need to support cross-subnet queries. Unfortunately, the prototype does not generalize to cross-subnet queries, so we need a completely new approach. In addition to that we need to solve the spec issues raised in this thread and the state explosion problem due to callers holding on to the old states. All this make ICQC a very difficult problem.

2 Likes

One design that we are considering is executing the response callback of an inter-canister query call against the latest state of the canister (which differs from the state when the call was performed). That would resolve the main performance/memory blockers of ICQC but would make the programming model more complicated because the response callback would not see the memory changes done by the original query.

Does this design not work for cross-subnet queries? I’m curious what makes cross-subnet queries that much more challenging than same-subnet queries.

I think having some limited, working version of ICQC―even if not complete―could still be potentially very useful.

2 Likes

Yes, from the implementation and performance point of view that design is the most promising. However, it is more difficult to use for developers. For example, it is not compatible with async Rust API because the response closure is stored on the heap, but the heap changes would not be preserved in the proposed design when the response arrives. This means that only static functions can be used as response callbacks. I am also worried that the developers will miss the subtle point that all memory changes are discarded by the time when the response callback runs, so it will be a major footgun.

2 Likes

Interestingly, if canisters program closer to the actor model, e.g. to be always upgradeable, even with outstanding calls, then such a model becomes a bit more plausible.

2 Likes

Great insight and write up, @nomeata! Indeed, the design looks more reasonable with the actor model.

1 Like

I chatted with Ulan about one-shot messaging a bit more. I really like the idea of using them for update calls but for ICQC they can be tricky. If user queries canister A; A sends a one shot message to B; and is waiting for a response; the system doesn’t know that it is waiting for a response and that it should wait.

I suppose we will need some mechanism for A to tell the system that it may still produce a response in the above case.

4 Likes

For example, it is not compatible with async Rust API because the response closure is stored on the heap, but the heap changes would not be preserved in the proposed design when the response arrives.

Can you explain why this is a problem with cross-subnet queries but not same-subnet queries? I’m not sure I understand why the heap is cleared for cross-subnet but not same-subnet.

I don’t think there will be any difference in the programming model / semantics depending on whether or not you are going cross-subnet or same-subnet. The async rust API difficulties will present in both cases.

2 Likes

Can you explain why this is a problem with cross-subnet queries but not same-subnet queries? I’m not sure I understand why the heap is cleared for cross-subnet but not same-subnet.

What we have currently is a prototype implementation of same-subnet ICQC that keeps all state changes in memory until all called inter-canister queries return. The main blocker for this prototype implementation is theoretically unbounded memory consumption (every call kind of forks the chain of state changes). The problem becomes worse with cross-subnet queries that have much higher latencies.

One way to fix the memory consumption problem is to change the semantics of the calls to discard the state changes. But that makes the programming model difficult to use (async rust API problem). The problem applies to same- and cross-subnet calls equally as @akhilesh.singhania mentioned.

2 Likes

Let’s say you have an ICQC that’s made in the context of a query.

Queries already discard state updates to canisters.

Are you saying that even local variables will be discarded within the context of a query that makes an ICQC?

1 Like

Queries already discard state updates to canisters.

A query discards the state updates after its execution is fully complete. What to do with the state updates when the query has a pending call (i.e. the query performed the call, but its response callback did not run yet) is an open question. Discarding the state changes leads to a confusing programming model.

Are you saying that even local variables will be discarded within the context of a query that makes an ICQC?

I guess you mean the local variables in async code while awaiting for the result of a call. The local variables are stored in memory across the await points, so discarding the state changes would also discard the local variables. Even worse: the implementation of async/await in Rust stores internal information in memory, so discarding the state changes would make awaiting impossible (that’s what I meant by “not compatible with async Rust API” earlier).

3 Likes

Thanks, this makes sense.

When I make an inter-canister update call, the IC runtime is able to save (i.e. fork) the current state of the canister at the time the call is made, and then resume from that state when the call returns.

Is that difficult to apply that same code to inter-canister query calls? You mention unbounded memory consumption as a blocker. But what I don’t quite understand is how the IC’s implementation of inter-canister update calls was able to avoid that problem? It seems to work fine for both same-subnet and cross-subnet updates.

1 Like

The key insight here is that the update calls are happen “on-chain” whereas query calls are happening off-chain.

If you think about the blockchain as a git repo, each time we execute an update call on a canister, we are adding a new commit to the main branch of the repo.

Each time we are executing a query (or a ICQC), we are creating a branch in the repo. We know that we will eventually discard the whole branch (when the ICQC is done) but while the ICQC is ongoing, we have to maintain it so that we can serve the calls properly. And then if we imagine that there are a bunch of ICQCs happening in parallel and the call graphs are quite complex or deep, then that is a lot of state that the node has to maintain.

That’s why the approach of not maintain intermediate state during the ICQC will help in significantly reducing the amount of resources required to serve ICQCs.

8 Likes

Somewhat tangential but perhaps related, I noticed that “F+1 query calls” is on the roadmap for a Q4 2022 release.

IIUC, F := upper bound on the # of corrupt replicas.

My understanding is that ingress update calls (and inter-canister calls) currently go through N - F consensus, where N := # of replicas in a subnet, whereas ingress query calls don’t go through any consensus.

Since N - F consensus is a higher bar than F + 1 consensus, is it accurate to say that this feature will enable query calls that take longer to execute than current ingress queries but faster to execute than full-blown updates? If so, are inter-canister queries the intended use case for this? If not, what’s the purpose of this feature?

(Just throwing out guesses, please let me know if I’m completely off-base.)

2 Likes

@akhilesh.singhania @ulan

I’m trying to understand the ICQC “state explosion” problem better.

  1. Does your same-subnet ICQC prototype use the same input and output queues that regular ingress and inter-canister calls use? In this case, ICQCs would “compete” with the update calls for canister execution. Or do they bypass this queue mechanism?

  2. Currently, does a query call spawn a new thread on a replica? (By the way, I’m curious if these threads all still run on a single core, or can they now run on multiple cores?) If so, maybe another approach is to maintain a thread pool that processes query requests from a shared queue? This could limit the impact of a “state explosion” to the size of the thread pool.
    a. Then, we could also add a restriction that ICQCs can only be initiated from either an ingress query call or another ICQC. This would reduce the likelihood that a single ICQC monopolizes a thread (and all of the threads depending on it).
    b. Another idea is to put a hard cap on query execution time. If that cap is exceeded, that query—and all its child ICQCs—fail, and an error is returned to the client.
    c. One more idea: while a query is waiting on an ICQC to return a response, that thread could execute another query in the meantime. Just like how the JavaScript event loop deals with async function calls.

WDYT? Or am I misunderstanding the problem?

1 Like

Queries and updates use complete different mechanisms. The problem is about holding on to dirty state of canisters while they wait for replies.

First let’s consider update calls. If canister executes an update call at state A and modifies its state to A’, the new state is committed and the state of the canister now becomes A’. All future update calls will run again A’. The system is free to drop A whenever it is convenient for it.

And when you consider the current situation without ICQC, we see that queries will either run against state A or state A’. So when state A’ is available, the system can let new queries run against A’ and when the last queries running against A finish, the system can safely drop it.

Note that currently without ICQC and with upper limits on how long queries can run for, we have a simple upper bound on how long we need to hold onto older states of canisters.

Problem 1 with ICQC

Let’s say the user queries canister 1 at state A. Canister 1 sends ICQC to canister 2 and so on. The system has to hold on to state A of canister 1 till the entire ICQC call graph is finished. If we allow for call graphs with unlimited depth (which is currently the case with update call graphs), there would be no upper limit on how long the system has to hold onto state A for canister 1.

Problem 2 with ICQC

The problem above is actually much worse. Note that when canister 1 executed the query at state A, on top of sending queries to other canisters, it also modified its state to A’. The system has to hold on to not state A but actually A’ because when the replies come back, they are expecting to run against A’.

This might not seem like a big problem immediately as the number of states that the system is maintaining is not actually changing (A’ instead of A). However, if you consider how the states are maintained in the system, this is a huge problem i.e. we maintain different states of the canister by maintaining deltas from previous versions of the states. So A’ is actually maintained as A + delta. So what we are seeing here is that each time a canister makes an ICQC, we have to maintain a separate additional delta for that canister. And the more canisters are involved in a ICQC call graph, the more deltas that system has to maintain. And without bounding the call graph, we cannot bound the number of deltas we have to maintain.

Further, note that all the above deltas that are maintained for queries, need to be maintained only temporarily. Once a ICQC finishes, all the deltas can be dropped. They also need to have high availability as generally we expect queries to run fairly quickly. This means that we need to have access to a large amount of storage that also has low latency access behaviour and can be recycled quickly. All of these are very challenging problems.

I hope the above gives some appreciation of how many different versions of a canister’s state (i.e. deltas) we will have to maintain if we implement ICQC naively i.e. with the same guarantees and semantics as we have for update messages.

This is why @ulan and I are proposing a ICQC design where the replies run against a “clean” state of the canister. With that design, we are not maintaining additional canister states that we have to maintain today. This will make programming with ICQC more difficult as has already been discussed above.

3 Likes

Hm, “more difficult” sounds a bit like an understatement. Do we have a story for how such a semantics could be reconciled with async in any form or shape?

1 Like

Note that currently without ICQC and with upper limits on how long queries can run for, we have a simple upper bound on how long we need to hold onto older states of canisters.

I wasn’t aware there currently was an upper limit on how long a query can run for. Is that limit measured in time or cycles?

If we allow for call graphs with unlimited depth (which is currently the case with update call graphs), there would be no upper limit on how long the system has to hold onto state A for canister 1.

Maybe we should add an upper limit then? For example, a query and any ICQCs that it kicks off can only run for a grand total of X seconds (or cycles).

And the more canisters are involved in a ICQC call graph, the more deltas that system has to maintain. And without bounding the call graph, we cannot bound the number of deltas we have to maintain.

Why don’t we just impose a limit on the # of concurrent queries a single replica can process (and add pending queries to a queue)? For example, executing queries on a per-replica thread pool will naturally impose a limit. The call graph doesn’t need to be bounded then. Every canister’s replicas in the chain of ICQCs are responsible for limiting their own query concurrencies. Together with a total time/cycle limit on a query, this should prevent the state explosion problem, I think?

2 Likes