Are message responses guaranteed on the IC?

dfxjesse · September 20, 2024, 10:14pm

From reading this thread, it seems there will be a change that allows developers to choose between Best-Effort Messages and Small Messages with Guaranteed Responses.

That post also mentions:

The long term goal of messaging on the IC is to ensure that canisters can rely on canister-to-canister messaging, irrespective of subnet load and with reasonable operational costs.
This goal is impossible to achieve with the current messaging model;

So does that mean, in practice, something like this is possible:

try {
  let #ok(reponse) = await canisterB.getMyResponse() else return
  savedResponses := response // Never called because #ok or #err was never sent back
} catch (error) {
 // error never caught
}

If canister B executes some code successfully and tries sending the #ok response back, but canister A never gets it, thats when things get very tricky as a developer to handle, because if you call it a second time you might get an error like “response already sent” and you will never know what the result of that was.

Does anyone have any knowledge, experiences or thoughts on this?

Tagging a couple of people: @infu @skilesare @icme

skilesare · September 20, 2024, 10:37pm

This has not been released yet. When it is, the pattern to follow is to produce a unique request ID that can be sent along with a second request. The other canister can then cancel it if it has already processed that request ID.

dfxjesse · September 20, 2024, 10:42pm

What about in the current system? Can a response never be received or an error caught, under specific circumstances?

skilesare · September 20, 2024, 10:49pm

The current system guarantees a response to the point that if you call a malicious canister and they decide to just hold on to the request, you can’t restart your canister because it will wait for the response. You can timeout if your request stays in the outgoing queue for too long(5minutes I think) but once it is gone you are going to get a response or die trying.

dfxjesse · September 20, 2024, 10:54pm

So there is also a scenario where lets say canister B executed some code successfully and tried sending the response back - but maybe it’s lagging and it hit a timeout in it’s outgoing queue.

So canister A, never got anything in this scenario?

skilesare · September 20, 2024, 11:34pm

Well…in that scenario it wouldn’t be a response. If you send a request out you cannon shut down your canister until you are delivered a response. If b is delayed you just kind of chill until b is ready to respond. You can process other requests, but you can’t start/stop your canister.

dfxjesse · September 20, 2024, 11:46pm

Do you know if the response can be timed out in B’s outgoing queue then? Meaning this can happen:

dfxjesse:

try {
  let #ok(reponse) = await canisterB.getMyResponse() else return
  savedResponses := response // Never called because #ok or #err was never sent back
} catch (error) {
 // error never caught
}

skilesare · September 21, 2024, 1:02am

I don’t think so. B has a response queue and I haven’t heard anything about responses being timed out…only outgoing requests.

free · September 21, 2024, 5:38am

The scenario you describe will not happen with the existing guaranteed responses. But, as Austin says, it is possible for canister B (if it has enough cycles) to just spin forever (making downstream calls of its own) and to never produce a response. Which will make your canister impossible to stop. Also, even after canister B has produced a response, while there’s a guarantee that the response will be delivered, there is no guarantee regarding how long it takes to deliver said response.

Leaving aside the fact that the IC’s messaging layer guarantees response delivery, what you describe is something that virtually every single distributed system has to deal with: there is no such thing as guaranteed message delivery unless you’re willing to wait arbitrarily long. And a system that takes arbitrarily long to do even the simplest thing is not a very useful system. Just how likely would you be to use this forum if every 10th interaction took a few hours?

Also, this is exactly how ingress messages work: you send an ingress message; it is executed and the response certified; you have exactly 5 minutes to query for its status. If your internet connection goes down for 5+ minutes, you will never find out what the response was. If your canister does not provide an alternative way of learning the outcome, then you have not built a very useful application.

That would be an actively hostile API, unless there was some side channel to query for the status of your request (such as a ledger you can scan for your transaction). It’s a bit (or very much) like having an endpoint that replies with “I won’t tell you the outcome” on every single call.

infu · September 21, 2024, 11:23am

Could you confirm for clarity, possibly with simple yes or no answers? I’m not debating whether the new messaging system is good or bad—I just need to be sure of how it works.

Does that mean it will happen with the new messaging system?

Does this mean that each canister endpoint must independently manage request idempotency? Additionally, each endpoint should offer a way to check the outcome if no response is received?

@dfxjesse Is every endpoint you use to manage neurons providing these?

dfxjesse · September 21, 2024, 12:51pm

Currently no, I am dealing with a scenario like in the example scenario I showed earlier except with spawning neurons. The code looks something like this:

try {
  let #ok({ created_neuron_id }) = await neuron.spawnMaturity() else return
  lastSpawn := created_neuron_id // Never called because #ok or #err was never sent back
} catch (error) {
 // error never caught
}

If the code is executed correctly on the NNS but I never get a response, I can go in and check the list of spawning neurons - but I can’t call the endpoint again and see what the last spawn was, nor can I pass any request ID to see the result of the last request.

I don’t think many (if any?) canisters provide this luxury and we just have to jump through extra hoops to make sure things are robust right now.

skilesare · September 21, 2024, 2:31pm

Is it possible you’re getting an error? This code doesn’t handle the error case and will just return…I don’t mean an error in processing, but a candid #Err(“some message”) response.

Maybe try

try {
  let created_neuron_id  = switch(await neuron.spawnMaturity() ){
     case(#Ok(val)) val;
     case(#Err(err){
        //handle some caught error;
     };
} catch (error) {
 // handle thrown errors;
}

infu · September 21, 2024, 4:45pm

Yeah the example is missing the #Err case, but what we want is to also solve the case when something goes wrong and we don’t get a response.
These things need to be 100% working or we will have to upgrade the canister with hotfixes if something goes wrong and tokens/neurons go missing inside the canister.

stefan.schneider · September 21, 2024, 7:35pm

FYI you might be interested in the next session of the scalability and performance working group: Technical Working Group: Scalability & Performance - #119 by abk

The topic will be the new messaging model.

skilesare · September 22, 2024, 5:20pm

My understanding is that this should be impossible with the current model. If your request times out in outgoing message pool you’ll get that lovely Ingress error message we’ve seen so much of on the BoB subnet.

free · September 23, 2024, 9:12am

Yes. Best-effort calls are explicitly intended to (among other things) prevent canisters from hanging forever if the callee doesn’t (want to) respond. This also means that there will be situations where the callee does respond, but only after the call has timed out (and a reject response was produced and handled) on the caller’s side.

Do note, however, that the new messaging system will not replace the existing one. Best-effort calls will be an extra option that can be separately selected or not on each call.

There are ways other than idempotency to deal with best-effort message delivery. And you don’t need separate handling of each endpoint. For the most part, if you are e.g. implementing a ledger, a single endpoint to query / scan the ledger should be sufficient to support any and all operations: just set an expiration time on every transaction; then scan the ledger until just after said expiration time; if your transaction isn’t there, it didn’t happen.

Or just pick any of dozens of mechanisms that distributed systems have been using to ensure consistency across components.

Or, you can stick with guaranteed response calls. They’re not going away. But do be aware of their shortcomings: a call may “never” (in practical terms) complete; they are expensive in terms of system resources (your canister needs to reserve memory for the largest possible response for the guarantee to be an actual guarantee); and while the response is guaranteed (because you’ve reserved the resources for it) a new request isn’t (because you may not be able to reserve resources for the response and because request delivery cannot possibly be guaranteed).

free · September 23, 2024, 9:21am

That is exactly right. All calls are guaranteed to get a response.

Right now (and in the future, unless you change your canister code to explicitly ask for a best-effort call) you will be getting “the” response. I.e. every guaranteed response call has one single response (whether that’s generated by the callee; or by the system because it was unable to deliver your request); and the caller will eventually receive that unique response. There is no such thing as “not getting a response”.

With best-effort calls, you will still always get a response, but it may not be “the one”. Instead, if your specified timeout expires or the subnet is under heavy load you may get a “no idea what happened with your call” reject response instead of the real thing. Which needs to be handled differently from either a canister response or a reject response stating “your request was definitely not delivered because X” (or even the more gray-area “canister trapped” reject response).

infu · September 23, 2024, 9:40am

Great to hear, thanks for explaining.

It’s what we are doing with our ledger middleware even now.
Being able to select which messaging system we want to use is a great option to have, since some API endpoints of swaps and other token smart contracts don’t have deduplication or logs.
Expiration time I guess will be part of the message options, because there isn’t expire_at field in icrc1_transfer

free · September 23, 2024, 12:57pm

Right, but expiration time doesn’t give you a hard guarantee: even if a subnet will drop a request as soon as it expires (which is how message expiration will work), it’s still possible that the request started processing before it timed out, but continued after (because of DTS or because of downstream calls). So there is no upper limit for how late your transaction might have completed. (I.e. you can’t safely say “if it’s not in the ledger by time T, then it has definitely failed”.)

So a new ICRC may be needed.

infu · September 23, 2024, 4:01pm

We are right now using the deduplication (memo+created_at) which guarantees there wont be double spending during a 24h window, so we can send as many transactions as we want only one should go through. If a transaction from our queue is not sent during the 24h window, we adjust the created_at so it gets retried more. But… when we switch to the new window and if the previous transaction is still going through but lagging behind for hours, then we will double spend. I guess we can just set the created_at to be back in time ~ 5 min before the window ends, then we will essentially have something like expire_at. Or we just send the transaction once every ~24 hours, but I think that’s too much. So we will have to put created_at back in time if we want something like 30min expiration.

Note to oneself: When designing API endpoints requests should have request_id and expires_at

Topic		Replies	Views
Question about Message Guarantees Developers	20	953	July 24, 2023
Are intercanister calls safe? Developers	4	393	June 10, 2024
Scalable Messaging Model Roadmap community-consideration	100	3240	June 10, 2025
Messaging guarantees Developers	38	2430	June 1, 2023
More info about "Locally timing out canister requests (Core Protocol)" Developers	9	361	April 24, 2023

Are message responses guaranteed on the IC?

Related topics