Given that there were quite some questions regarding messaging guarantees on the IC in the global R&D today, I’m starting this thread which will provide more details (I hope by tomorrow) and can then be used as a basis for further discussion.
I’m trying to decide if I think this is a feature or a bug with regards to one shots. I was relying on a high probability of delivery and expecting failure only in a small number of instances, most of which were predictable like upgrades or stopped canisters where I could check for missed messages in start up. Typically the one shots were confirming themselves via other one shots back to the broadcaster. If I keep this pattern then I guess I will now be able to guarantee that failure has occurred after 5 minutes and mark the item as undelivered.
This works for instances where I’m waiting for confirmation, but I can see how it is sub optimal for true fire and forget use cases as they will now fail under heavy load as opposed to only extreme and prolonged heavy load.
Thanks for kicking this off @derlerd-dfinity1, and my apologies if this comes off as a bit of a rant - I was just a bit taken off guard by the request lifetime/deadlines announcement in the R&D, and it seems that several other developers had a different interpretation of the inter-canister transactional guarantees the IC has been providing.
If a developer wants to interact with a 3rd party canister on the IC, the safest way to do so is through a one-shot call. There are several security issues with inter-canister calls if the canister you are communicating with is not blackholed, DAO controlled, or owned by you.
An example of setting up a canister to lock out any canister that calls it was shown in this simple example
by @nomeata for which he then proposes one-shot or “fire-and-forget” calls as a good workaround in this blog post
https://www.joachim-breitner.de/blog/789-Zero-downtime_upgrades_of_Internet_Computer_canisters
@JensGroth brought up the security issues related to inter-canister calls in this comment, saying that for this reason (non-responsive canisters) it is recommended to only talk to “trustworthy” canisters
For this reason, many developers over the past year reading the forums have decided to use one-shot calls in their applications to relay messages to other canisters, as they thought that even if backed-up, there was a high probability or guarantee that they would eventually be delivered (only wouldn’t be delivered if the receiving canister was deleted or the API was changed).
This all changes if outgoing message are dropped after 5 minutes. That regularly happens when a canister is under heavy load or a subnet gets backed up. Just this past week, I spoke with a developer who filled up their output queue, so then they started batching transactions and built a canister load balancer to deal with the traffic and get around these output queue limitations. What seemed at the time like a good way to balance load and jump over the output queue limits, seems like it may have just ended up being a clever way to overwhelm the downstream consuming canister and drop messages just as fast (after 5 minutes of waiting in queue).
However, saying that one-shot calls are unreliable and should not be used means that developers have to trust the canisters they interact with. 99% of the canisters on the IC are mutable and are not DAO controlled. Therefore, any sort of interoperability based on trust is unreasonable, and if developers are unable to rely on one-shot calls, then they have to choose between the security risk of using 3rd party canisters and the change that their one-shot call will fail/time out.
It also just so happens that last week I was looking through the IC’s queue code when further looking into queue limitations and found this:
It looks like the queue deadline (message dropping) feature was introduced 4 months ago
At the time, I wasn’t as up to speed regarding what input and output queues are, but right now this seems like a breaking change that should have gotten more attention. For most smaller applications, this isn’t a huge deal - but most larger applications that interact with many different services (i.e. DSCVR) or process a lot of load (i.e. SNS sales) this means if the canister queues get behind or the subnet slows down not just ingress messages, but inter-canister messages can and will be lost.
It’s extremely difficult to think about transactional guarantees and to implement any sort of async transactional workflow across multiple canisters, and with a lack of transactional guarantees one needs to fully flush out a SAGA type transaction design in order to be able to revert distributed transactions that failed. All of a sudden, building reliable, resilient applications on the IC gets a lot harder
Next steps?
Securing inter-canister calls without requiring trust
It might sense to first think about how the IC can make inter-canister calls with 3rd party services safe for canisters. If there’s a timeout on messages in the output queue, what about a timeout on messages being executed by a canister? We already have DTS which allows messages to span several rounds of consensus, so maybe a reasonable message should be expected to execute within 2 minutes (threshold)?
Identifying best practices and patterns for applications for handling high throughput
In 2023 there’s a good chance that we’re going to see the first IC applications go “viral”. This means a lot of traffic, and hopefully dev teams are prepared (know the limitations and boundaries of the IC’s scalability).
There are definitely ways to get around message dropping (implementing distributed queues for example), but this then involves adding more canisters into the architecture, more latency, more chances for failure, and more complexity. Ideally if these problems can be solved at the protocol level or certain SLA guarantees can be given about uptime/message delivery (i.e. 5 9s+), then developers can focus on building their app instead of having to worry about message delivery.
Thanks a lot for your posts. We hope to be able to clarify all the questions above – in particular why the introduction of timeouts is not a breaking change – with the explanation below. Also, thanks a lot for your suggestions on next steps to discuss. We like the suggestions but would propose to discuss them in separate threads.
Canister messaging guarantees
The IC protocol provides two guarantees for canister-to-canister messages:
- Request ordering: Successfully delivered requests are received in the order in which they were sent. In particular, if a canister
A
sendsm1
andm2
to canisterB
in that order, then, if both are accepted,m1
is executed beforem2
. - Guaranteed responses: Every request is guaranteed a response (either from the canister, or synthetically produced by the protocol).
(No) Guaranteed request delivery
There are no guarantees regarding successful request delivery: a request may fail synchronously (e.g. if the sender’s output queue is full; or when trying to enqueue an output request when the sender subnet is at memory limit) or asynchronously (if the receiver’s input queue is full; or the receiver is stopped; or frozen; or the receiver subnet is at memory limit). If the receiving canister panics when handling the request (which may also happen in the absence of canister code bugs, e.g. canister exceeding the instruction limit or running out of memory), the effect is similar to the request never having been delivered in the first place: an asynchronous reject response.
One additional potential source of request delivery failure is being added in the form of request timeouts (enabled in the release elected this week). Apart from the exact error message, this will look the same to the caller as a full input queue at the receiving end (or the receiver having trapped; or rejected the request; or being stopped/frozen; or the receiving subnet being at limit in terms of memory). Before timeouts one would have seen errors in these cases; the only difference now is that backlogs are cleared sooner. Nothing changes from the perspective of the canisters – these failure modes needed to be handled before and they need to be handled now.
There was a blog post introducing the concept of one-way messages, which essentially suggests ways to enable ignoring a response. This post also explicitly mentions that one-way calls can only be used if one does not care about potential failures:
“A one-way call is a call where you don’t care about the response; neither the replied data, nor possible failure conditions.”
So again, in cases where one cares about potential failure modes, one needs to implement an explicit confirmation measure for such calls on the other side. Timeouts do not change anything to that. The blog post also suggests that in an example:
“Maybe you want to add archiving functionality, where the ledger canister streams its data to an archive canister. There, again, instead of using successful responses to confirm receipt, the archive canister can ping the ledger canister with the latest received index directly.”
Going back to basics, we also want to note that guaranteed message delivery is impractical for a few reasons:
- It requires infinite resources at the receiving end.
- Even if infinite memory were available, it would lead to arbitrarily high latency.
- Even if a canister would be OK trading off arbitrarily high latency for guaranteed message delivery, there is no way of insulating other canisters sharing the same XNet stream from similarly high latency.
Definitely did not come across as a rant to me, I think we’re all happy with your feedback and it highlighted that we need more discussion, so thanks for raising this.
So as @free explained one-way calls were never reliable. We are thinking about a solution to both get reliability and ensure that you can always update your canister, with named-callbacks. You would still upgrade your canister without stopping (like Joachim also suggests with one-way calls), so upgrading always works. But now that you have valid (named) callbacks, you can actually make things reliable, eg by retrying in the callback if you get an error back. Note that this allows you to build reliable communication, because there is a guarantee that the callback will always be called for every call.
I think it is important to distinguish between reliable and probable. In the past, it was more probable that a one-shot message would be delivered under a high load. Now it will for sure die after 5 minutes. That changes the texture of applications that are written using the methodology.
It seems to me that it also creates a potential bandwidth snowball as messages are retried. I’d be interested in an interpretation of how the network will respond here if all those cleared messages are instantly retried after the timeouts…will congestion spawn in other areas? If those calls got backed up because they are 2MB messages, how will the network handle that load? Is clearing out the messages quickly productive if they all just show up again?
Could we request a retry without rebroadcasting all the data in the message again?
How do these timed-out messages get their cycle cost calculated? Who ends up paying for them? Is there any cycle cost?
Just some random questions as I think through this.
This makes perfect sense. 100% guaranteed message delivery does feel impractical from an engineering standpoint, and would definitely end up bogging down the network - either through one-shots or awaited calls. However, it’s common for applications to handle brief spikes in traffic where a queue or the network will temporarily get behind - while it’s convenient for the network to throw away requests, developers who are focused on a top-tier end user experience need to build in excess capacity to handle “reasonably expected” spikes in traffic and to hold onto backlog for a certain amount of time.
Awaiting calls then imposes its own throughput constraints via canister output queue limitations.
Is this live? Are there any implementation examples in Rust or Motoko?
What about the case where awaiting on a call from a 3rd party canister that is stalled?
What is currently implemented is timing out of requests in canister output queues. These messages never made it out of the canister state, so there cannot be any effect on the network.
The worst possible outcome of canisters naively retrying said requests is the canister ending up in the exact same situation: the same messages, with different deadlines, enqueued in the output queue.
Per the above, the payload never left the canister (much less been broadcast). Retrying the request would mean making the exact same request again, with the same payload.
The sender already paid for the request. This could make naive retries very costly.
Note that if you are doing one-shot messages as described in the blog post linked above, then this will result in exactly the same constraints on output queues as with awaiting. This is because for every request, when put into an output queue, there is a slot reserved in the corresponding input queue. This slot can not be used until the reply arrives and is then consumed once the reply arrives. So–if I understand correctly–the approach to one-shot messages from the blog essentially allows canisters to tell the IC they don’t care about the reply but the protocol still waits for the reply and then delivers it. The only difference is that the canister will panic and therefore not change its state when the reply arrives because an invalid callback id is passed upon making the call.
As far as I know there were some rough discussions about this, but nothing close to an implementation.
If one does one-shot messages and the 3rd party canister sends back one-shot ack messages this would still work (as also suggested in the blog post). When awaiting I can currently not think of any solution.
Does this solve upgrades since everything will time out after 5 minutes? Or at least if you put an if(halt ==true){asset(false)} at the top of all your update functions that allows you to quick return, would it at least eventually time out all outstanding requests even if a malicious canister were holding you hostage?
I guess another way, is the limit for being held hostage now 5 minutes per request?
Unfortunately not. Only requests in output queues that didn’t make it into the subnet-to-subnet streams are timed out. This is because as soon as a message is in a stream others have seen it and we don’t know whether they have started processing it at the destination or not. Responses don’t time out at all. Would be an interesting direction to further explore, though.
Another idea that once came up to solve the upgrade problem was to limit the depth of the call graph – so similar to the limit on the number of hops a message can take when being routed through the internet – a request could also define how many “hops” it can make (i.e., how deep the call graph can become). This would prevent a malicious canister from going into an endless loop using self calls as the “hop” limit would be exceeded at some point. One would, however, also have to think very carefully about whether this really solves all problems. But this is also an interesting direction to explore.
Does this intersect with time slicing at all? I’m guessing from the previous response that the answer is no…but wanted to double-check. If the message is accessed then it gets processed even if we get to Nx time-slicing even if Nx is greater than 5 minutes.
Is this an example of this:
Server returned an error:
Code: 400 ()
Body: Specified ingress_expiry not within expected range:
Minimum allowed expiry: 2022-12-08 22:44:43.809049592 UTC
Maximum allowed expiry: 2022-12-08 22:50:13.809049592 UTC
Provided expiry: 2022-12-08 22:44:01.591 UTC
Local replica time: 2022-12-08 22:44:43.809050223 UTC
Run the say function with “45” and wait a good long while:
I was trying to experiment with trying to do some awaits in parallel with futures and figuring out how many I could batch together. With the following code I could do 30 but not 45:
So I tried the first one expecting it to take a long time, but not to fail…maybe this is just the response being rejected by the boundary node for being to slow?
I’m surprised I haven’t seen this before…maybe the proxy only lets requests take a certain amount of time? likely the update function actually would have run?
Update: I ran it with DFX and it eventually ran…I added a tracker var to confirm that the state was being updated: Motoko Playground - DFINITY
So maybe this is just a strange thing with the playground? I have confirmed that even thought the playground gives the above error that the updated did run and increment the tracking variable.
@derlerd-dfinity1
The complete solution is this:
@Manu @free
Is there someone specific that can make the name-callbacks feature happen?
I am building a system where each user gets a smart contract that can talk with other smart-contracts using the system’s communication-standard. One of the points, is that anyone/company/other-services can write their own smart-contract that implements the communication-standard and be compatible with all of the user-smart-contracts in the original-service, and that user-smart-contracts can be owned and controlled by the users with opt-in upgrades. For now, when calling other canisters, the user-smart-contracts go through a special safe-caller canister that first replies to the user-canister, then awaits the call, then calls a “callback”-method on the user-canister. This makes sure that the user-canister is always upgrade-able, but the final callee in this way does not know the original caller. We don’t want the callee to have to trust the safe-caller(for who the original-caller is) because that would make inconsistencies for who the original caller is, where some canisters trust the safe-caller and some don’t, and some go through it to call other canisters and some don’t. Conclusion: let’s get the name-callbacks feature moving so canisters can make direct calls to each other in a safe way.
As I’ve programmed with one shots it has occurred to me that there is a version of the IC in an alternate universe that ONLY allows one shots and that forces much simpler, safer, and well thought out code.
If we had gotten the panacea of orthogonal persistence that we thought we were going to get then the headaches of managing async message would have been unthinkable, but given that IC development is as difficult as it is, I’m not sure that asking users to manage their message flow would have been a heavy lift.
No. As soon as a message leaves an output queue it can no longer time out.
No, this is unrelated. What the error seems to say is that an ingress expiry smaller than the minimum ingress expiry was provided. @chenyan do you know how the error above could happen in the playground?
Yes, this is high on our list. The relevant team for this spent a lot of time on the Bitcoin integration, and we’re about to open up a higher replication subnet to the public (with cost scaled linearly). Other things we’re working on is ensuring good subnet performance with many (say 100k) canisters, and then we have some heavily requested features that we want to pick up next, like the ability to install larger wasms and the named callbacks for safe upgrades. So it will take some time, but i hope you agree that all the things the team is working on are useful :).
Is it possible to “simply” check the timestamp at the canister level? From what I understand the undefined behavior only happens when the canister receives a response that might call into a function that’s no longer there after the upgrade, and thus calling into some random memory.
Could this not be prevented by checking a timestamp on the response and drop it if it’s 5min+ old?
This way canister operators could put their canister in “maintenance mode” where outgoing calls are no longer possible, wait 5min (or whatever timeout gets implemented) and then upgrade safely.
edit: alternatively, as a “slot” on the incoming queue is reserved for every outgoing message, could this logic be applied on the queue as well? Check if the reservation was made 5min+ ago, if so drop.
(I’m just asking, I’m not really aware of what such an approach would entail on the technical side)
Note that one can never be sure that a message will time out after 5 minutes. Timeouts are (currently) only applied while a message is in the output queue. However, as long as there is space in the respective subnet-to-subnet stream, messages will be routed into the stream at the end of each execution round and from then on no longer time out.
Right. I guess a better question would have been “is this technically possible / feasible”? Additionally, would it help, or would it break other promises that the protocol is based on?