Scalable Messaging Model

A subnet running out of memory due to input/output queues filling up will not cause a canister to trap mid-execution. It would only cause synchronous outgoing call error results which the canister can handle without trapping and it would cause a canister to not be able to receive incoming calls at the method entry points but again would not cause a canister to trap mid-execution by something outside the canister’s control.

A canister (or a canister author/maintainer) is able to make sure that there is always enough cycles and also can check with the canister_balance api in case of trying to send cycles or create a canister with cycles. This is still within the canister’s control.

I think I know what is going on here. Even though there are no specific cases where a canister could trap mid-execution by something outside the canister’s control, you would still hold the assumption that it could happen. It looks to me like this is because there is no claim on the contrary anywhere in the protocol spec, so as the good engineer that you are, you assume the worst case. In this case however, that assumption is not held by DFINITY as a whole as the evidence by the ckBTC minter canister’s code shows (as spoken about in the last thread) where the ckBTC minter does rely on the guarantee that a canister cannot trap mid-execution by something outside the canister’s control and thereby does rely on the response guarantee of the ckbtc-ledger’s-response to make sure it mints 1:1 btc for ckbtc.

If you are now thinking about designing the scalable message model around the assumption that a canister could trap mid-execution by something outside the canister’s control, and therefore, the assumption that current responses are not guaranteed because if a canister can trap mid-execution by something outside the canister’s control, then that means that the trap can happen while handling a response, making it as if the response never reached the canister, then this would worsen the fragmentation even more. It is good that now this underlying question is being brought to the surface so we can settle this matter and create harmony between the canister implementors (ckBTC canister implementers) and the replica implementers (the execution team). The one location that is the authority of the contract between the canister implementors and the replica implementors is the ic interface-specification.

Hi @bjoern, and @mraszyk, Does the protocol guarantee that a canister will not trap mid-execution by something outside the canister’s or canister-maintainer’s control? Or does the protocol allow for the possibility of a canister to trap mid-exection by something outside the canister’s control?

There is a big difference between these two statements with far reaching consequences in different directions depending on which statement is correct.

Once the correct statement is chosen, it fits into the interface-spec as the single and final authoritative source of the contract between the replica implementers and the canister implementers.

2 Likes

A canister can trap in the middle of its message execution at least in the following cases:

  • the canister runs out of WASM heap memory (4GB): then the canister traps whenever the WASM tries to allocate more WASM heap memory
  • the subnet runs out of memory available for canisters: then the canister traps whenever the WASM tries to allocate more WASM heap memory (although here one can argue that using an explicit memory allocation this can be mitigated)
  • it runs out of instructions (although here one can argue that using the performance counter system API, the canister can detect that this is approaching)
4 Likes

Since the conversation was specifically about responses: When a message is sent out there are cycles reserved for the processing of the response, so that even if the canister is stopping or frozen then the response will still get processed.

I am also very interested in how a response could fail because I have been designing all my protocols around the assumption that they won’t. It seems to me what remains, outside the developer’s control, is this one:

It might be hard to exclude the possibility that memory has to grow during the response. Even the tiniest temporary heap allocation can cause it. Most likely, the runtime will already make an allocation when parsing the value carried by the response. For practical purposes at least this is outside the developer’s control. Hence, an explicit memory allocation with a large enough margin seems to be the only way to mitigate it.

1 Like

OK, I understand the scenario and problem now but am not convinced about the argument. I agree that if the failure is asynchronous then I have more options. The advantage comes the fact that an asynchronous failure takes time so it functions like a “timer” to reschedule a subsequent attempt at a later time that can also keep the call context open (something a real timer cannot). By retrying multiple times spaced apart in time I increase the chances of success, so I increase the chance that my synchronous API can report success. But I cannot retry indefinitely, i.e my synchronous API cannot guarantee success. It still has to account for a failure possibility. I can reduce the frequency of failure and thus can improve the average user experience. But from an API design standpoint I still have to account for the possibility of the failure that you describe, where I get stuck with a half committed transaction and have to exit the call context in that state. I end up having to provide a solution just like I have to with synchronous failures, for example putting unfinished tasks in a backlog and having a way to trigger processing the backlog in a different call context. In summary I understand that I may be able to improve average user experience but don’t see a difference in terms of implementation complexity. I don’t follow the argument that this new approach let’s me “provide a straightforward synchronous API” where the old approach doesn’t.

1 Like

Following up to what Timo and Martin have already said: If you consider the “wasm-level execution,” a canister does not “just trap” in the middle of an execution due to random external influences. So it is technically possible to build a canister that does not trap. But it’s still hard due to the reasons Alin, Martin, and Timo have already stated; the canister has to be very precise in managing its resources (memory, cycles per execution, cycles balance) as well as consider carefully which APIs are safe to call. Practically, hardly any canister would meet these criteria and thus building the canister defensively so it does not rely on this strong consistency seems a commendable approach.

3 Likes

I like this option best because I think it’s quite a nice practical compromise. In many cases, just looking at the callee’s interface (e.g., its .did file), the caller can already know that the callee will produce small responses in the success case. For example, the ICRC-1 and the ICP ledger return just a number in the case the call succeeds. In theory they may return a larger response in case of a failure, as they can return strings, but at this point the caller likely doesn’t care so much why the request failed. This way the caller doesn’t have to wait on the callee to upgrade before they can start using small guaranteed responses. And note that this doesn’t preclude further user-level aids where the callee could explictly indicate maximum response sizes in its Candid defs, and where Motoko and the CDKs could do some compile-time safety checks based on that.

If the callee tries to produce a larger response despite its promise, the system would still generate a synthetic reject response, so in a way it wouldn’t be globally unique. And assuming that we let the callee trap, it’s a more draconian failure mode than a soft failure of converting the response. But granted, it’s a violation of a promise.

2 Likes

Thing is, if you take that approach (and IMHO anything dealing with non-trivial amounts of tokens/cycles should), the API provided by your canister is now by definition asynchronous: on the happy path it is “make a call, get a response”; but if there’s a non-zero chance that it can return an “oops, transaction still in progress” response, then for the caller it is not materially different from the possibility of getting a timeout response (maybe with a bit more context, such as a transaction ID).

I guess my point is that the existing messaging model and the newly proposed additions all live in a gray area and (IMHO, not speaking for anyone else) it is wrong to expand any guarantee actually provided by the spec/system (such as the existing response delivery guarantee; or the proposed call quotas) beyond their very narrow, very specific definitions. Specifically, I wouldn’t expand the response delivery guarantee into a “consistency across canisters” guarantee; just as I wouldn’t interpret the call quotas as guaranteeing correct and consistent fully synchronous APIs.

1 Like

Since there appears to be general agreement that the additions we suggested to the messaging model are likely to be useful; and the discussion mostly consisted of questions and the pros and cons of specific design alternatives; we will be submitting an NNS motion proposal soon (aiming for this Friday). The text of the proposal will be based on the post at the top of this thread. And will link back to this thread for further discussion.

We very much appreciate all the feedback we got so far and look forward for continued community input.

3 Likes

Thanks guys for your responses and clarifications, knowing for a fact the protocol specifics is a great help.

I think a big point here is that this fact puts both the control and the responsibility at the same time into the hands of the canister to make sure the canister handles it’s messages without a trap.

I for sure do not want to expand/assume on the guarantees. Is it correct to say for a fact (for the current messaging guarantees that exist at the time of this post) that it is the canister’s responsibility and it is within the canister’s control to make sure it does not trap while handling a response, and it is the protocols responsibility to make sure to deliver (and only deliver) the correct globally-unique response?

1 Like

I have a question related to the proposal for best effort messages. They a) introduce a timeout response and b) do not guarantee uniqueness (callee may respond but caller sees timeout). I haven’t thought through examples to see the usefulness. But my question is what about an approach with timeouts but uniqueness? So that the caller can always tell if the callee has processed the message or not. If there is congestion and a timeout happens, no matter where, then at least the responses are very small because they only contain the timeout error. So hopefully it would be easy to deliver them.

Has this been considered? Or is it too close to the current model? In the current model can time out in the caller’s output queue but not later. So their return time is unbounded which can be a problem for applications.

1 Like

The protocol provides a hard guarantee regarding response delivery. And it provides a hard guarantee regarding atomically persisting changes resulting from executing said response if the canister does not trap while executing it.

But there is no explicit guarantee to the effect of “it is within the canister’s control to make sure it does not trap while handling a response”. I suppose that it is a design goal to avoid to the extent possible trapping while handling responses (it is e.g. why cycles are set aside for executing the response as part of making a call). But the spec is just not detailed enough to be able to guarantee it. (Nor should it, IMHO, as that level of detail would inevitably prevent meaningful and necessary protocol development.)

So while I’m not 100% sure about this, it may even be possible to ensure that your canister never traps if it executes on the current replica version. But it is just as possible that a replica change that is entirely within spec and otherwise harmless (e.g. increasing the instruction cost of something or other) will push your carefully managed and tuned canister over the limit and cause it to trap. With new replica versions released once a week.

To put it differently, this is an assumption you are free to make when designing your dapp (just as, to pick a random example, payment providers implement a transaction deduplication window of N days, under the assumption that all transactions will complete within less than N days). But it is not a guarantee in the same way that response delivery is a guarantee.

1 Like

We have considered something similar, namely, within the context of guaranteed response delivery, allowing callers to set a deadline on the whole call tree resulting from a call; and when that deadline was exceeded, to synchronously fail any downstream calls and force the call tree to unroll from the bottom up. On paper it sounded like a really nice idea, but it would basically guarantee that multipart transactions would end up in an inconsistent state (same first part committed, second part failed scenario as above, but with a hard guarantee that you could neither roll back nor forward). And it still wouldn’t provide a meaningful time bound, since an arbitrarily deep tree would have to be rolled up completely, and a message scheduled and executed for every call in said tree)

Now thinking of your suggestion, the time bound guarantee would be even weaker: as long as the request was not yet delivered, you would get a timeout response guaranteeing non-delivery within however long it would take to deliver that response cross-subnet. But as soon as the request was delivered and started execution, all bets are off. Which might still be useful within the context of a scenario where you control both caller and callee (or at least trust the callee and know its worst-case behavior). But not very useful outside of that very specific use case.

Thinking about it some more, if you control both caller and callee, you may even design a protocol of your own with deadlines attached to calls; and if the timeout was the same for all calls targeting a given callee, you would get behavior that was very similar to the system timing out your request while still in-flight.

1 Like

@bjoern is your above statement a guarantee of the spec? Can you write down in the spec whether it is or isn’t within the canister’s control to make sure it does not trap while handling a response?

I do not want to make any assumptions. I want to build on the facts of the protocol specification. If there is no explicit guarantee, then can there be an explicit non-guarantee written down in the spec?

Can we get a table like this (for one canister):

Values are in messages/sec.
Limits measured by calling a simple function at (B) that returns something right away.

Before Jan 2024 with proposed changes
one way A → B ? ? ?
one way A → B (diff subnet) 27 m/s ? ?
A → B ? ? ?
A → B (composite query) ? ? ?
A → B (diff subnet) ? ? ?
A → any ? ? ?
A → any (diff subnet) ? ? ?
…add more… ? ? ?

We conveyed a test ~ 40 days ago and it showed something like 27 m/s for "one way A → B (diff subnet) ". It was throwing errors when new calls were made - “one way call limit reached”

3 Likes

The changes to the messaging model are not about increasing throughput. Taking this to the extreme, if we consider that introducing new message types will require adding one or two more fields (guaranteed/best-effort flag; deadline) to message headers, one could actually argue that there will be a minute reduction in raw throughput.

The point of best-effort messages is (to allow for) improvements in fairness and latency. While guaranteeing time bounds for responses and per-canister quotas; and reducing costs (or at least keeping them flat). Small guaranteed response messages only address the latter two issues (guaranteed quota and low cost).

E.g. assume a large user-facing application consisting of multiple canisters (e.g. a sharded ledger or document storage) under very high load. With guaranteed responses your canisters’ queues may be clogged with responses going back and forth long after everyone has given up refreshing their browsers. Whereas if your canisters were using best effort messages, much of that backlog would disappear within tens of seconds or minutes; and part of the user traffic would still make it through. Or simply consider calling out to untrusted canisters. Ot ensuring timely upgrades. And so on and so forth.

Looking at your specific test, even though it’s not entirely clear from your description, I believe that you ran into the backpressure limit: we allow up to 500 in-flight calls between a pair of canisters, beyond which any new calls will fail synchronously. Otherwise you may end up in the situation where canister A produces a request to canister B every second (e.g. from a heartbeat); and canister B needs 2 seconds to produce a response (or, equivalently, the system requires 2 rounds to deliver one response because they are huge); the resulting backlog will grow indefinitely. Especially when using guaranteed response calls We have no plans of increasing that 500 in-flight calls limit.

Edit: We do have ideas about how to increase XNet throughput (by using a side-channel to transfer data between subnets; and only including payload hashes into blocks), but that is not part of this proposal. And it is unfortunately not a trivial addition.

3 Likes

The spec describes when and why a wasm module can trap. So yes, it is a guarantee of the spec that, on the wasm level, canisters do not “just trap.” Note that depending on things like the CDK used, it may not be trivial to translate the same guarantees to the higher levels.

3 Likes

Awesome, thank you for that clarification :pray:.

So now that we know it is guaranteed in the spec that it is within the canister’s control to make sure it does not trap while handling a response, and we know that with the current messaging guarantees, the replica makes sure to deliver the one single globally-unique response, therefore my current feedback is that I will not be switching to any new message types that removes or changes these facts.

1 Like

Note that the proposed changes are completely independent of whether or not it is possible to write a canister that avoids trapping while handling a message. If anything the new message types make it slightly easier to ensure that a canister doesn’t trap in certain situations.

Also putting it explicitly here, just in case people are worried: there is no intention to change anything with respect to the spec’ed behavior of canisters in terms of (not) “just trapping” randomly. I think the above discussion is very interesting because it points out a couple cases where it is far from trivial to actually ensure not trapping in practice (and despite having a spec conformant replica implementation), but I’d suggest to keep the two discussions separate from each other.

3 Likes