Inter-Canister Query Calls (Community Consideration)

The limit is in instructions just like there is an upper limit on how long update messages can run for. Otherwise, a canister can trigger an infinite query and consume resources forever.

We have upper limits on how long a single query can run for. Putting upper bounds on ICQCs breaks composibility. Essentially it can mean that a call graph like A → B → C → D may fail but C → D may succeed. So you are unable to compose functionality by combining smaller functionalities together.

We precisely have this right now. Currently we can process at most 4 queries in tandem. We explicitly do not have a queue because having a queue does not help. What do you do when it is full.

The problem with ICQCs is that when a canister on subnet A sends a ICQC to a canister on subnet B, what do you then on subnet A? Do block the query thread on subnet A? If you do, then you create a DoS vector. If you do not, you need to store the state of A as discussed above.

2 Likes

At an absolute minimum, the API would need to enable attaching temporary out-of-band state (of arbitrary size) to each ICQC, which the canister can query upon receiving the reply.

Because if there was no way for temporary state to survive the ICQC (other than just the i32 env value supplied), then for most practical purposes, ICQC’s would be a completely useless feature – a canister would not be able to put the reply it receives into any useful context.

2 Likes

That makes sense. Maybe there is a way for a canister to hand off some state to the replica to hold on to till the reply comes back.

1 Like

We precisely have this right now. Currently we can process at most 4 queries in tandem. We explicitly do not have a queue because having a queue does not help. What do you do when it is full.

I’m curious if 4 is temporary and will increase, or is that there to stay?

The problem with ICQCs is that when a canister on subnet A sends a ICQC to a canister on subnet B, what do you then on subnet A? Do block the query thread on subnet A? If you do, then you create a DoS vector. If you do not, you need to store the state of A as discussed above.

Good question, although I do wonder why that’s not an issue with same-subnet ICQCs? In both cases there is a latency (although it’s exacerbated with cross-subnet).

I’d assume you would, as you said, hand off some state to the replica to hold onto in some kind of “heap”. And if that heap is full on that replica, then the canister can respond to the query call with a 50x error. (Or if the user is fine with throwing away state or does not use ICQCs, then the canister can still process the query.)

1 Like

I summarized the current state of ICQC and the trade-off space in this document.

Feedback and comments are very much appreciated!

3 Likes

Pasting the contents of the document here for better readability.

Background

The Internet Computer (IC) has two types of messages: updates and queries. As shown in the table below, queries are fast because they are read-only and don’t have to go through consensus.

Update Query
State changes persisted :heavy_check_mark:
Goes through consensus :heavy_check_mark: ✘*
Low latency, high throughput :heavy_check_mark:
Inter-canister calls :heavy_check_mark: ✘**

(*) A user may choose to run a query in an update context, where queries are executed on all replicas along with update messages. The state changes of queries are discarded, but the results go through consensus. We will refer to queries as replicated and non-replicated depending on whether they run in an update context or not. Replicated queries should not be confused with certified queries that also go through consensus but not in an update context. Note that certified queries currently exist only as an idea and are not implemented.

An update message can call methods of other canisters even if they are located on different subnets. Such inter-canister calls are essential for composing canisters into scalable applications.

(**) Queries do not support inter-canister calls. There is an incomplete implementation of ICQC that is enabled on verified subnets, but it is not generally available for the reasons explained below.

Requirements

The main requirement for ICQC is consistency with the existing inter-canister calls in update messages. Developers already know how inter-canister calls work and have certain expectations. Specifically:

A. [async/await] A call in a query should have the same semantics as a call in an update message. More concretely, it should be possible to support async/await for query calls.
B. [caller-id] The callee of an inter-canister call should be able to read and trust the identity of the caller.
C. [replicated mode] A call in a query should work both in replicated and non-replicated modes.
D. [cross-subnet] It should be possible to call a canister in another subnet.

In addition to the four consistency requirement above, we want queries to remain fast and to not consume a lot of memory:

E. [fast execution] Queries are faster than update messages because they don’t need to keep track of modified memory pages. Ideally, ICQC does not regress this.
F. [low memory usage] Ideally, ICQC does not introduce additional memory overhead.

Note that requirements E and F are important for stability and availability of IC.

Prototype Implementation

Verified application subnets have an incomplete prototype implementation of ICQC. The prototype was implemented under time pressure before the launch of IC. There was not enough time to think through the design and to write a specification. The prototype satisfies only two requirements: [async/await] and [caller-id]. Other requirements are not satisfied:

  • it works only in non-replicated mode,
  • it works only for same-subnet calls,
  • it is up 2x slower because of the need to keep track of modified pages.
  • it has to hold on to the state until all calls complete, so in some cases it may double the memory usage of the replica.

Trade-off Space

The bad news is that the requirements are in conflict with each other. We identified two trade-off pairs:

  1. [async/await] vs [replicated mode, fast execution, low memory usage].
  2. [caller-id] vs [cross-subnet].

In each trade-off pair we can choose only one alternative. For example, the prototype implementation corresponds to [async/await] + [caller-id]. It seems that [async/await] is non-negotiable. Sacrificing it would result in a strange, non-intuitive programming model. Given that, the only other viable combination is [async/await] + [cross-subnet], where all inter-canister query calls are effectively anonymous/public.

Explanation of Trade-off 1

Consider the following generic async query function that calls another query:

async fn query_foo(input: Input) -> Output {
    let data = pre_process(input);
    let result = call(canister_bar, "query_bar", data).await;
    post_process(result)
}

It prepares some data, issues a call to another query, awaits the result, and finally returns a processed result. IC executes the functions as two separate messages. The first message runs from the start of the function to the await point. At the await point, the runtime of the language (Rust/Motoko) is going to save all necessary information in the WebAssembly memory (future, task, environment) such that when the reply message of the call comes back, the execution can continue from the await point to the end of the function.

The crucial part here is “save all necessary information in memory”, which means that the state changes are important and can no longer be discarded. Thus, a query becomes similar to an update message until the call completes. Figure 1 visualizes the effect of ICQC on canister states. Normally, the state of a canister evolves linearly, changing only from one round to the next. A query that supports ICQC introduces a branch in the chain of states. This doesn’t work in replicated mode because the replicated state supports only one state per canister. The need to keep track of state changes makes execution slower and increases memory usage.

Figure 1. A query call creates a branch in the linear chain of canister states.

Explanation of Trade-off 2

All messages in IC are signed by the key of the sender, which can be either a user or a subnet. Figure 2 shows the scenario where a user sends a non-replicated query to a single replica. If the query calls another canister on another subnet, then that message cannot not be signed because the replica does not have the key of the subnet (by design). This means that the callee cannot trust the id of the caller and has to treat the incoming query as anonymous or public.

Figure 2. The user sends a non-replicated query to one replica in the first subnet, which in turn sends an inter-canister query to a replica in another subnet.

Conclusion

Based on the trade-off space, we have the following options:

  1. ICQC in non-replicated mode and only on the same subnet. This corresponds to the existing prototype implementation. If we go with this option, then we would enable the prototype everywhere.
  2. ICQC in non-replicated mode without caller-id. In this case the callee canister has to assume that the request is anonymous and respond only with publicly visible data, which greatly reduces the utility of ICQC.
  3. Do not support ICQC. Developers would have to move logic that calls other canisters to the client side.
9 Likes

Thanks for the detailed post―very insightful. I wasn’t aware of the [caller-id] vs [cross-subnet] trade-off until reading this.

Note that certified queries currently exist only as an idea and are not implemented.

I was under the impression that certified queries have been available this entire time. If so, then I don’t see the benefit of replicated ingress queries, as users could instead choose to make certified queries, which are faster than but still as secure as replicated queries.

Note that requirements D and E are important for stability and availability of IC.

Did you mean requirements 5 and 6? I don’t see D or E anywhere.

It seems that [async/await] is non-negotiable. Sacrificing it would result in a strange, non-intuitive programming model.

In that case, how will you solve the unbounded memory problem that you’ve identified as a major concern? You mentioned it may double the memory usage, but is that due to a hard cap that you will implement as part of this proposal or due to other reasons?

It’s also interesting that ICQC will be 2x slower…

Figure 1. A query call creates a branch in the linear chain of canister states.

In this figure, isn’t 2’ equal to 2? Because until Canister B responds, the forked query has suspended execution.


Based on your analysis, I think option 1 is the most sensible. Since developers cannot currently choose which subnet to deploy their canister onto AFAIK, I think implementing that feature would be a blocker for this.

2 Likes

These are certified variable. There is also an idea of certified or secure queries where queries would run on some nodes that then reach consensus among themselves without that being part of the blocks. This would ensure that the result of the query can be trusted even without certified variables.

Did you mean requirements 5 and 6? I don’t see D or E anywhere.

Thanks fixed. Letters were lost when copying the document contents.

In that case, how will you solve the unbounded memory problem that you’ve identified as a major concern? You mentioned it may double the memory usage, but is that due to a hard cap that you will implement as part of this proposal or due to other reasons?

Yeah, we have to introduce some hard cap and abort very long-running queries. If we only support same-subnet calls then this is less of a problem because we don’t have network latency.

In this figure, isn’t 2’ equal to 2? Because until Canister B responds, the forked query has suspended execution.

2’ contains changes in the memory (futures/tasks/environment) necessary to continue execution of the async query method after the await point.

Based on your analysis, I think option 1 is the most sensible. Since developers cannot currently choose which subnet to deploy their canister onto AFAIK, I think implementing that feature would be a blocker for this.

I am leaning towards option 1 as well.

2 Likes

Currently a user can run a query in replicated mode to ensure trustworthy result even without certified variables. If ICQC is only available in non-replicated mode, then we lose that nice property.

Thanks for the summary, @ulan!

It’s not overly surprising that caller id’s are in the way. Essentially, caller id and access checking by looking at caller id’s are a form of security by call stack inspection, which has long been a known source of problems, e.g., in the JVM or JavaScript. Needless to say that this could be retired if the IC adopted a modern, capability-based access model. :slight_smile:

9 Likes

If option 1 precludes the possibility of replicated queries, does it still allow for the possibility of certified queries, if and when that gets implemented?

Also, I hate to ask this, but assuming option 1 is favored by the community (perhaps in a proposal vote), when do you expect ICQC to be generally available? Would it be in a matter of 1-2 months or much much longer?

1 Like

Then what happens when a multi-canister application making these non-replicated queries scales beyond a subnet? Do the queries break, or pursue a different mechanism if on different subnets (i.e. a fallback)?

As a “hack” to get around this “publicly visible data access” issue, would this developer defined solution work instead of relying on the IC?

For example, what if the developer has the caller canister contract code generate a key that can be sent to the callee, and then the caller and callee canister can both store this key and agree upon it/update it as needed. Then anytime the caller queries the callee it would send this key as a function parameter, and the callee would verify that all incoming queries have this key before executing.

An improved version of this same idea would be instead of just a single key to use a public/private key pair that is used, where the callee has the public key and a transformation function, and the caller sends the private key as a function parameter.

2 Likes

About the second trade off: isn’t it good enough to have the replica of the calling canister sign with it’s node key, and the called replica check that this is indeed a signature of one replica of the subnet responsible for the calling canister? The called replica should have the necessary information in the registry.

It seem that the worst that can happen is that a malicious replica on subnet of canister X can obtain data that is confidential and should only be visible to X. But we are already trusting that replica to keep the state of X confidential, so this doesn’t seem to weaken the IC’s guarantees substantially, does it?

4 Likes

There is no concrete design for certified queries, so I am going to speculate here. If we disallow cross-subnet calls, then I don’t see any blockers. I think the existing prototype would work also for certified queries. If we support cross-subnet calls, then we probably need multiple rounds of consensus, which would increase the complexity of certified queries but at the same time would allow signing of the cross-subnet messages (as @nomeata noted in the very first comment of this thread).

Also, I hate to ask this, but assuming option 1 is favored by the community (perhaps in a proposal vote), when do you expect ICQC to be generally available? Would it be in a matter of 1-2 months or much much longer?

In terms of engineering work 1-3 SWE-months sounds like the right ballpark. Please note that this doesn’t directly translate to real months because there are other high-priority projects queued up and the teams are overloaded.

Some of the remaining work:

Since the calls are limited to the same subnet, we would need some mechanism to create canisters on the same subnet and also keep them together in case of subnet splitting. I think this is the main blocker.

We need to consider whether we want to introduce some way of charging for queries because the execution cost of all queries is going to increase significantly. That’s because we don’t know beforehand whether a query is going to call another query or not, so we would have to keep track of modified memory for all queries. (We could also introduce some annotation for developers to indicate that a query is going to use ICQC, but that seems a bit ad-hoc to me).

We need to introduce limits to guarantee that all queries terminate within a reasonable deadline.

1 Like

The cross-subnet query calls would indeed fail if we go with option 1. I don’t see a working fallback here.

For example, what if the developer has the caller canister contract code generate
a key that can be sent to the callee, and then the caller and callee canister can
both store this key and agree upon it/update it as needed.
An improved version of this same idea would be instead of just a single key to use
a public/private key pair that is used, where the callee has the public key and a
transformation function, and the caller sends the private key as a function parameter.

Thanks for sharing the idea! I am assuming you meant that the caller signs the message using the private key and sends the signature instead of sending the private key. I think it would work and seems to have somewhat similar security properties as the scheme proposed by @nomeata (for the case when a malicious replica obtains the private key):

Good point! I haven’t considered it from this angle. I think the answer depends on the security/threat model. In theory we trust the replica to keep the state of X confidential, but we can never exclude bugs in the implementation. In the very worst case the replica exploits such bugs and manages to leak the confidential data. Do we want to limit the damage to the subnet of the malicious replica or allow it to spread to other subnets?

I like having the subnet boundary as an additional security barrier, but may be I am too paranoid here. I wonder if the security folks have an opinion here. @robin-kunzler

4 Likes

… or run the opportunistically in the fast, non-tracking mode, and if they do make calls, abort and re-run in the memory-tracking mode. If the fraction of non-calling queries is low enough (as it likely will), determinism allows all kind of neat optimizations :slight_smile:

It doesn’t really “spread“ in the sense that it is still only data accessible to X that is under threat – data owned by unrelated canister Y is unaffected. The data may be living somewhere else, but conceptually, not a huge big deal – or at least small enough to not sacrifice the programming model for it :slight_smile:

Yep, that’s a neat optimization and I am implemented that a while ago :smiley: ic/query_context.rs at master · dfinity/ic · GitHub
The tradeoff is that queries using ICQC become even slower, so we might have to disable the optimization if popular canisters start relying on ICQC (not sure how likely it is).

@manu made a good point in our offline discussion yesterday: since a malicious replica can forge any query, it could be the case that the query would be impossible in a valid execution. For example, if the caller performs the access check and the callee trusts caller_id, then the malicious replica can get data that shouldn’t be accessible to X. I know it is a bad idea to have access checks on the caller side, but this is just to highlight that the issue is subtle and trusting caller_id may be a security footgun.

Hmm, good example. Although it’s still within “only data accessible to canister X is at risk ”, even if an uncompromised canister wouldn’t usually query that data.

And note that the leak you describe is comparable in impact (actually less) to the callee’s subnet having a single compromised node.

I’m still not convinced that this is worth breaking (or at least worsening) the programming model over a small strengthening of data privacy guarantees in a corner case. It’s odd if a normal call has the correct sender, but a call to a query function doesn’t.

3 Likes

We need to consider whether we want to introduce some way of charging for queries because the execution cost of all queries is going to increase significantly. That’s because we don’t know beforehand whether a query is going to call another query or not, so we would have to keep track of modified memory for all queries. (We could also introduce some annotation for developers to indicate that a query is going to use ICQC, but that seems a bit ad-hoc to me).

My personal opinion is that having an annotation wouldn’t be too bad in terms of developer experience if it can save them some cycles.

I think charging for queries may change the decision calculus for voters. Do you plan on submitting a proposal for Option 1 before implementation? I think it’d be nice to get community feedback on whether this tradeoff is worth it. (A proposal with a timer also makes it more likely for people to read and comment.)

1 Like