Question about Message Guarantees

I had a question about this concept.

  • Property 6: inter-canister messaging is not reliable.

Every inter-canister call is guaranteed to receive a response, either from the canister, or synthetically produced by the protocol. However, the response does not have to be successful, but can also be a reject response. The reject may come from the called canister, but it may also be generated by the Internet Computer. Such system-generated rejects can occur at any time, and cannot be controlled by canister authors. Thus, it’s important that the calling canister handles reject responses as well. Note that a reject response generally guarantees that the message had not been processed successfully by the called canister.

If you have a three canister message pass

I.e. A passes to B passes to C and A’s final behavior relies on C to be successful or unsuccessful, would this be impossible to safely implement?

because if B passes to C successfully and C’s message fails to get back to B (a reject response) to let it know it stored whatever the message was, A would recieve a failure right?

The second thing is which essentially follows up does this rule # 6 apply to responses to a call? A response to a called message isnt always guaranteed to get back successfully right?

2 Likes

Depends on B. B receives the reject response and then can handle that response by either trapping as well (returning a CANISTER_ERROR message) or by properly returning some value. A then has to handle that response properly

1 Like

A canister’s-response to a call is guaranteed to get back to the canister that made the call.

It is only when calling a canister that the call might not reach the callee-canister (due to network traffic or various reasons). But if the call does reach the callee-canister, then the callee-canister’s-response will always make it back to the caller-canister.

This makes smart-contract programming possible. Lets say a caller-canister is making a call to the ICP ledger canister to transfer some ICP. The caller-canister needs to know if the transfer is successful, so it is guaranteed that if the ledger does perform the transfer, the caller-canister will receive the successful response with the block height.

It is possible to implement this in a safe way, because of the guarantee that if a canister responds to a call, that response will always make it back to the caller.

1 Like

A canister’s-response to a call is guaranteed to get back to the canister that made the call.
It is only when calling a canister that the call might not reach the callee-canister (due to network traffic or various reasons). But if the call does reach the callee-canister, then the callee-canister’s-response will always make it back to the caller-canister.

But rule 6 is about a response verbatim. Its not talking about the initial call, no? I mean yea, youre logic makes sense and I agree it would be very dumb to not gaurantee that a response would occur but is there any proof like source code or documentation you have backing it?

But lets assume what youre saying is true. If you were to make a call to a canister which called another canister which called another x 100000 (in a chain of calls that would take a stupid long time like 30 minutes), like

for canister 1

func push_to_canister_n(nat:Nat): async nat {
if (nat ==1000000000000000)
{return 5}
else{
var x = await canister_<n>. push_to_canister_n(nat+1)
update_hashmap(x).
}}

then given the call makes it to canister 100000000000, canister 1 will eventually return 5 and whatever hashmap is being updated in update_hashmap(x) will update properly? Because if what youre saying holds true, there would never be no await timeoutt, given it makes it to the last canister in a call chain successfully.

Yea the documentation there is misleading. System-generated rejects can occur at any time before the call reaches the callee-canister. If the call reaches the callee-canister, then the system guarantees that the callee-canister’s response goes back to the caller-canister. I made a pull-request to make it clearer Update rust-canister-development-security-best-practices.md by levifeldman · Pull Request #1693 · dfinity/portal · GitHub.

In your code sample it looks like the canister-#1000000000000000 returns 5 and canister-#1 updates the hashmap and returns nothing.
But yes, if every new call reaches the next canister in the chain (it is not guaranteed that a call will reach the destination canister), then canister-#1000000000000000’s response is guaranteed to propagate back down the chain all the way back to the first canister.

Once the call reaches the callee, there will never be a timeout on getting the callee’s-response back to the caller.

Only if it takes too long for the call to reach the callee then the system can send/generate a reject response back to the caller after a timeout and cancel the call.

2 Likes

Really appreciate it. Your answers were super helpful.

And oh yea whoops, meant to return the x in the else in the function.

I have replied to your pull request, but I’ll also clarify here: basically every request only ever produces a single response (whether from the system, before delivery; from the canister, upon success; or from the system during the execution of the message or some downstream response).

But while delivery of this unique response is guaranteed by the system, there are many ways in which this is not quite foolproof:

  • There exists no deadline for producing or delivering a response (see canisters that cannot be upgraded because of open call contexts).
  • A reject response may be initiated by the callee (by trapping) or by the system, through no fault of the callee (e.g. if the subnet runs out of memory).
  • Such a reject response may also be produced at any time after the call has started being processed. Including, if the callee makes downstream calls of its own, while handling a downstream response message, after the request message has been successfully processed and any resulting changes persisted.
  • Even if the unique response is guaranteed to be delivered to the caller, the caller may still fail to handle it successfully, whether by trapping or (again) through no fault of its own due to the system (e.g. subnet running out of memory).

In summary:

  1. Request delivery is not guaranteed.
  2. Response delivery is guaranteed and there is always a single response (technically the callee and caller will both “agree” that this is the response).
  3. But the callee may have already persisted all changes (e.g. made a ledger transfer) before it trapped (possibly through no fault of its own).
  4. And the caller may also trap (possibly through no fault of its own) while processing the response.

I believe that points (1), (3) and (4) are why the developer docs claim that “inter-canister messaging is not reliable”. It is definitely not what you get from e.g. Ethereum, where a whole call tree executes atomically. By ACID standards, IC calls are definitely not atomic and (when traps or system errors are involved) not even consistent (canisters A and B may well end up with different views of the world).

2 Likes

Such a reject response may also be produced at any time after the call has started being processed. Including, if the callee makes downstream calls of its own, while handling a downstream response message, after the request message has been successfully processed and any resulting changes persisted.
3. But the callee may have already persisted all changes (e.g. made a ledger transfer) before it trapped (possibly through no fault of its own).

I’m assuming these go together but so youre saying it is possible to prevent a scenario like this to chain back by adding a thrown exception after the await? But if the thrown error was not there, its a gaurantee for the response to get back to the initial canister?

  func push_to_canister_n(nat:Nat): async nat {
if (nat ==1000000000000000)
{return 5}
else{
var x = await canister_<n>. push_to_canister_n(nat+1)
update_hashmap(x).

pseudocode: if it is canister # 15 throw error  here
//Therefore canister 16 -> canister 1000000 update while canister <= 15  arent able to process the returns of the other canisters???
return x
}}

Now the second question I have now.

Even if the unique response is guaranteed to be delivered to the caller, the caller may still fail to handle it successfully, whether by trapping or (again) through no fault of its own due to the system (e.g. subnet running out of memory).
4. And the caller may also trap (possibly through no fault of its own) while processing the response.

if the calling canister can trap after recieving the response with no fault of its own, i.e. not cause of cycles, not cause of canister memory, not cause of an error by the canister developer but like something outside the devlopers control, isnt the end result basically the same as if the callee canister finished successfully processing a message and the response failed to make it. Which end result wise contradicts Levi’s statement quoted below. I know it doesn’t contradict literally but I think you get my point. Or are you saying the caller will for sure process the response if recieved but can trap randomly afterwards?

Yea the documentation there is misleading. System-generated rejects can occur at any time before the call reaches the callee-canister. If the call reaches the callee-canister, then the system guarantees that the callee-canister’s response goes back to the caller-canister. I made a pull-request to make it clearer [Update rust-canister-development-security-best-practices.md by levifeldman

This does contradict my statement. My statement is that if the canister’s code is correct and the canister-settings is set correct, then if the callee processes a call and returns a response, then the system guarantees that the callee’s-response goes back to the caller. @free is saying that some things outside the developers control can happen to either canister through no fault of its own that can make it trap.

@free, the canister controller can set the memory_allocation field in the canister settings to be sure that the subnet reserves enough memory space for the canister to function. If the canister settings is not set correct, it is the fault of the canister. Are there specific things outside the developer’s control that can happen that cause a canister to trap through no fault of its own?

A functioning service on the IC must have a way to guarantee that if the configuration of the canister code and settings is correct, then responses will be delivered, and the canister will not trap randomly. It is not possible to build something that talks to ledger canisters and transfers tokens and transfers cycles if callbacks can randomly trap because of things outside the developers control.

There are many potential situations under which a message execution (in this case the handling of the response) could fail and get rolled back. A subnet is a replicated virtual machine. You may get failures which are specific to one replica (e.g. a memory or disk error). Those would not lead to the message failing. But you also have myriads of ways in which execution could fail deterministically across all replicas.

Running out of memory is the obvious one. And, to be honest, most canisters don’t have explicit memory reservations; so when the subnet runs out of memory, virtually all calls will run out of memory and trap. But it could also fail due to instruction limits, running out of cycles or a stopped/stopping canister. Or trap due to a stack overflow or division by zero. Or, due to bugs in either the replica or the canister.

Even in the absence of traps and aborts, you would have issues due to request delivery and/or processing not being guaranteed. E.g. imagine that your canister makes 2 downstream calls (whether sequential or parallel), with only one of them succeeding; after which (e.g. due to running out of cycles or some canisters getting uninstalled) neither call can be retried or rolled back.

Regardless of how and why message execution may fail, my point is that there are many ways in which two canisters may end up disagreeing on the state of the world even given guaranteed responses. This is different from an (inherently much more limited) atomic all-or-nothing model, such as that of Ethereum.

As soon as you have even one minor, exceedingly unlikely way in which canister B may process a request but canister A may fail to process the response, you no longer have guarantees. Only a high likelihood of success.

Not true. People have been building highly reliable systems from quite unreliable components for a long time now. You just need to program defensively, retry, use two-phase commits, etc.

Not all applications will require that level of reliability, so you should only care about this if e.g. you’re building a DEX.

Just to clarify, so a subnet running out of memory is considered a random failure I cannot do anything about as a developer? The other examples seem to be controllable by developers such as 1. you dont allow your program to go out of instruction limits in any syncronous block of execution 2. if you control every canister in an async call you can ensure you dont uninstall any and ensure they top up every time by deploying a pay per transaction model 3. etc.

But speaking on the bigger picture:
I think Out of the developers control is what really needs to be known. Because if I do every single thing possible to ensure that callee B successfully can process a message from A and I randomly receive a failure, there are applications (as you said a DEX) in which caller needs to be able to track a successful response from a callee if the callee processed the message correctly.

An example, if I call to the ICP token ledger to transfer 10 tokens to you and it sucessfully transfers in its own data structures and when it notifies me. If both the developer of the icp ledger and my own canister do everything in our control at the developer end to ensure no traps, topped up canister, messages arent too long, etc… and it fails, defi would be impossible to implement outside of a single canister model.

Let me take a step back. Leaving aside the specific failure modes, there is a fundamental difference between e.g. Ethereum’s atomic execution model (where, barring any bugs, either everything succeeds or nothing does) and the IC’s asynchronous execution model (where you have independent canisters; processing every call across multiple, fallible atomic executions; running across multiple virtual machines).

These virtual machines have hard limits, beyond which they must (and are specifically allowed to) start failing message executions. Canisters also have hard limits that may cause them to either trap or at least refuse to execute messages. Guaranteed response delivery necessarily implies (quite literally) arbitrarily long delivery times. (E.g. you are guaranteed to get a response from a canister, but it is entirely reasonable for this response to be delivered years later, if ever.)

Under such a system there cannot be an end-to-end 100% guarantee that a transaction involving multiple parties will complete correctly or even just complete (as you’d get with a naive atomic approach, a la Ethereum). The best you can do (if you ensure you reserve enough memory; you never use recursive calls; you only ever call trusted, bug-free canisters; someone is always keeping an eye on subnet health and making adjustments as necessary) is something like 99.99% likelihood of correct behavior. In practice, in my experience this means close to perfect behavior under good weather and large flare-ups in failures under heavy load or other adverse conditions.

Given the above, I would rather code defensively (and rely on known, proven solutions – timeouts, retries, idempotent operations, two-phase commit, etc.) than try to obsessively tick as many boxes (reserved memory and compute; foolproof cycle provisioning; only ever calling my own or system canisters, etc.) and hope that nothing ever fails even though almost everything except response delivery is explicitly allowed to fail.

Not true. You can have idempotent operations, implement rollbacks, two-phase commit, etc. People have been successfully doing this for decades.

What I find dangerous is the position that “I’m doing everything safely and correctly so, if you just do everything safely and correctly on your end, the guarantees that the system provides mean that we should never bother to implement basic safety features such as idempotent operations, write ahead logs or transactionality”. It gives people the false impression that (since no safety nets are built in) they are in fact never needed “if you just do everything right” (whatever that means). And it necessarily results in fragile systems and eventual cascade failures (even assuming that everyone is doing a perfect job writing and deploying their canisters, which we all know is impossible).

1 Like

This confirms my original statement, that response delivery is guaranteed by the system. (if there are no bugs in the replica, and if the node machines are functioning)

Now, about canister code taking into account possible replica bugs:

  1. Thank you for mentioning the value of making logs, I made a proposal here for logging each canister message, it’s cause, caller, arguments, and result. This can be a great help for checking the correctness of the canister state and also can help to fix the canister state in case of a trap roll back.

  2. How far should a canister coder go to ensure replica bugs don’t harm the canister’s state? What happens if the replica incorrectly checks the caller’s signature? Should the canister also ask for the caller’s signature on the request in case there is a bug in the replica? Should a canister take into account that an outgoing message can reach the wrong canister because of a bug in the replica? Should a canister take into account that a wasm instruction like 2+2 can come out to equal 5 because of a bug in the replica? Is it dangerous to rely on a 2+2 operation being equal to 4? Is it dangerous to rely on the protocol delivering a call to the correct intended callee? There are many more samples of this. How far does it go?

  3. A quick check of the ckBTC canisters looks like they bank/rely on this response delivery guarantee for the correctness of the ckBTC canisters’ code. This line here in the ckBTC-minter canister mints the ckBTC for the user based on the received btc-utxos and marks the utxos as minted once a successful response comes back. If the ckBTC ledger canister mints the ckBTC but then the minter-canister receives an error response back (or worse doesn’t receive a response at all), the minter-canister will not mark those utxos as minted and will mint them again next time the update_balance method is called. The minter is not using idempotency when calling the ckbtc-ledger. It can be great if we can get a confirmation by the ckBTC team if they wrote the canisters to rely on response delivery or if they intended to use multiple phase commits, response timeouts, and idempotent operations on every receiving and outgoing inter-canister call.

1 Like

That being said, you are free to believe in the infallibility of the protocol. Even though there are clearly specified cases in which the system is explicitly allowed to fail to execute a message. And no trace of any statement to the effect of “these are the only cases in which message execution will ever fail”. On a platform that is constantly evolving.

Looking at it a bit differently, this response delivery guarantee is not a guarantee that you will get any response. It is merely a guarantee that you will never get a bogus response.

If I can make a (possibly unsavory) analogy, if you are the canister and I am the system, then the guarantee you are getting from me is that I will never poison you. And (as I interpret it) the conclusion you draw from this is that, because I will never poison you, if you just eat right, exercise and never leave the house, you will never get sick and will live happily ever after.

  1. A quick check of the ckBTC canisters looks like they bank/rely on this response delivery guarantee for the correctness of the ckBTC canisters’ code. This line here in the ckBTC-minter canister mints the ckBTC for the user based on the received btc-utxos and marks the utxos as minted once a successful response comes back. If the ckBTC ledger canister mints the ckBTC but then the minter-canister receives an error response back (or worse doesn’t receive a response at all), the minter-canister will not mark those utxos as minted and will mint them again next time the update_balance method is called. The minter is not using idempotency when calling the ckbtc-ledger. It can be great if we can get a confirmation by the ckBTC team if they wrote the canisters to rely on response delivery or if they intended to use multiple phase commits, response timeouts, and idempotent operations on every receiving and outgoing inter-canister call.

But what is being pointed out in this statement is that this is production ready code backed (I think audited?) yet they themselves are making the assumption that if a reject response is received from the callee canister, they process it as if the callee canister (CkBtc canister) did not mint the transaction.

So if the minting canister (Caller) can receive a reject from the callee while the callee minted/processed the tx then this should be brought up to the team right? Even a 0.0001% error in double spend is a problem.

What I find dangerous is the position that “I’m doing everything safely and correctly so, if you just do everything safely and correctly on your end, the guarantees that the system provides mean that we should never bother to implement basic safety features such as idempotent operations, write ahead logs or transactionality”. It gives people the false impression that (since no safety nets are built in) they are in fact never needed “if you just do everything right” (whatever that means). And it necessarily results in fragile systems and eventual cascade failures (even assuming that everyone is doing a perfect job writing and deploying their canisters, which we all know is impossible).

But what youre saying is similar to saying that all developers on Eth and Btc prepared for a 51% attack.
With Bitcoin and eth, we do have certain guarantees that i.e. with bitcoin we guarantee no double spend. With Eth, I can guarantee that if I divide by 0 ten lines down a block of code, every single line above that will be rolled back. If a 51% attack happened someone could change that. But if thats the case, we consider the system as faulty. No developer prepares for a double spend on bitcoin. The 0.00001% case that something faulty occurs invalidates the system entirely.

Thats what I am asking with my original question. if the caller receives a reject while the callee successfully updated itself properly i.e. (the callee updated a balance it its own hashmap), then do we consider the system as faulty. No need to prepare fo a guarantee that if failed would invalidate the system as a smart contract platform. Sure making logs, rolling back, etc. would work but one big problem that smart contract platforms sell themselves as, especially in the financial world is immutability and trustfulness at the very minium for Double Spending Bugs.

If youre saying rollbacks and logs etc. is a must, then for anything financial where we may want certain guarantees, you’re essentially saying were better off going to one of those evm style L1s. And that ICP is just another cloud compute platform but with better decentralization and governance than AWS, GCP etc. But youre basically saying you cannot guarantee no double spend (outside of a single canister model defi smart contract that is self contained).

1 Like

Yes, that is pretty much what I’m saying.

Do note a couple of points, though:

  1. I’m saying this as myself. I’m not claiming this is DFINITY’s official position on the matter. (As partially illustrated by the ckBTC and ICP ledger code.)
  2. The ckBTC and ICP canisters run on system subnets. These subnets e.g. have much tighter limits on message backlogs and they have A LOT more memory headroom than the average busy application subnet. They only run next to other known and trusted canisters (among other things, there are assumptions regarding these other canisters not making (many) cross-subnet calls and not abusing the CPU). And they don’t have to pay cycles, so they cannot run out of cycles.

This being said, I am not going to argue my position anymore. I believe it is wrong to piece together a bunch of disjoint low-level guarantees and/or expectations to construct an overarching high-level guarantee. Because it is fundamentally different from a top-down, atomic, all-or-nothing guarantee (as you would get from Ethereum).

And, as you point out, you can get the exact same guarantees as Ethereum (and quite a bit more throughput) if you limit yourself to a single canister (and treat everything outside of it as fallible).

1 Like

What you’ve said makes sense and I was only searching for the truth to make a safer decision on my code implementations. I appreciate the discussion and will definitely keep in mind your points when developing.

1 Like

That assumption is correct. If you receive a reject response from the ckBTC canister then the mint did not happen.

This cannot happen.

Yes, in that case the system would be faulty. It would be a protocol failure.

But note that you have to be precise about the definition of what a “reject response” is. There are 5 reject codes described in the spec here. 4 of them mean and guarantee that the callee did not update. 1 of them refers to the voluntary reject of the callee (aka throw). It is called CANISTER_REJECT and has reject code 4. The callee can intentionally update its state and send this reject response. So in summary, as the caller, if you get reject code 1,2,3,5 then the system guarantees you that the callee didn’t update (i.e. effectively didn’t see the message) and if you get reject code 4 then you rely on the callee’s code to have done the right thing.

In the case of the ckBTC canister of course the ckBTC canister wouldn’t mint something and then respond with CANISTER_REJECT.

1 Like

This cannot happen as per the messaging guarantees.

No, I suppose the minter is conservatively programmed. It locks a given utxo before calling the ckBTC canister to mint and only unlocks the utxo if a reject response comes back. Hence, if no response comes back (or is indefinitely delayed) then the utxo would just stay locked forever. To the worst case scenario is that a user doesn’t get his minted tokens, not that something gets double-minted.