Messaging guarantees

Thanks for kicking this off @derlerd-dfinity1, and my apologies if this comes off as a bit of a rant - I was just a bit taken off guard by the request lifetime/deadlines announcement in the R&D, and it seems that several other developers had a different interpretation of the inter-canister transactional guarantees the IC has been providing.



If a developer wants to interact with a 3rd party canister on the IC, the safest way to do so is through a one-shot call. There are several security issues with inter-canister calls if the canister you are communicating with is not blackholed, DAO controlled, or owned by you.

An example of setting up a canister to lock out any canister that calls it was shown in this simple example

by @nomeata for which he then proposes one-shot or “fire-and-forget” calls as a good workaround in this blog post
https://www.joachim-breitner.de/blog/789-Zero-downtime_upgrades_of_Internet_Computer_canisters

@JensGroth brought up the security issues related to inter-canister calls in this comment, saying that for this reason (non-responsive canisters) it is recommended to only talk to “trustworthy” canisters

For this reason, many developers over the past year reading the forums have decided to use one-shot calls in their applications to relay messages to other canisters, as they thought that even if backed-up, there was a high probability or guarantee that they would eventually be delivered (only wouldn’t be delivered if the receiving canister was deleted or the API was changed).

This all changes if outgoing message are dropped after 5 minutes. That regularly happens when a canister is under heavy load or a subnet gets backed up. Just this past week, I spoke with a developer who filled up their output queue, so then they started batching transactions and built a canister load balancer to deal with the traffic and get around these output queue limitations. What seemed at the time like a good way to balance load and jump over the output queue limits, seems like it may have just ended up being a clever way to overwhelm the downstream consuming canister and drop messages just as fast (after 5 minutes of waiting in queue).

However, saying that one-shot calls are unreliable and should not be used means that developers have to trust the canisters they interact with. 99% of the canisters on the IC are mutable and are not DAO controlled. Therefore, any sort of interoperability based on trust is unreasonable, and if developers are unable to rely on one-shot calls, then they have to choose between the security risk of using 3rd party canisters and the change that their one-shot call will fail/time out.



It also just so happens that last week I was looking through the IC’s queue code when further looking into queue limitations and found this:

It looks like the queue deadline (message dropping) feature was introduced 4 months ago

At the time, I wasn’t as up to speed regarding what input and output queues are, but right now this seems like a breaking change that should have gotten more attention. For most smaller applications, this isn’t a huge deal - but most larger applications that interact with many different services (i.e. DSCVR) or process a lot of load (i.e. SNS sales) this means if the canister queues get behind or the subnet slows down not just ingress messages, but inter-canister messages can and will be lost.

It’s extremely difficult to think about transactional guarantees and to implement any sort of async transactional workflow across multiple canisters, and with a lack of transactional guarantees one needs to fully flush out a SAGA type transaction design in order to be able to revert distributed transactions that failed. All of a sudden, building reliable, resilient applications on the IC gets a lot harder :sweat_smile:

Next steps?

Securing inter-canister calls without requiring trust

It might sense to first think about how the IC can make inter-canister calls with 3rd party services safe for canisters. If there’s a timeout on messages in the output queue, what about a timeout on messages being executed by a canister? We already have DTS which allows messages to span several rounds of consensus, so maybe a reasonable message should be expected to execute within 2 minutes (threshold)?

Identifying best practices and patterns for applications for handling high throughput

In 2023 there’s a good chance that we’re going to see the first IC applications go “viral”. This means a lot of traffic, and hopefully dev teams are prepared (know the limitations and boundaries of the IC’s scalability).

There are definitely ways to get around message dropping (implementing distributed queues for example), but this then involves adding more canisters into the architecture, more latency, more chances for failure, and more complexity. Ideally if these problems can be solved at the protocol level or certain SLA guarantees can be given about uptime/message delivery (i.e. 5 9s+), then developers can focus on building their app instead of having to worry about message delivery.

8 Likes