We’ve been trying to execute several methods within the canisters, but I’m encountering continuous timeouts with every execution in the mainnet. As an example, I’ve attached screenshots of a simple request in the user_index canister, where I attempt to fetch the wasm_module from GitHub and place it within the user_index canister for use. During this test execution, no logs or prints were received from the user_index canister, and the same issue occurs with other requests.
It’s worth noting that when running everything locally, it works correctly. We’ve been seeing these errors since last Friday. Everything worked correctly before, we even did a demo with the DFINITY team on Wednesday, and the code wasn’t changed at all between those days.
I’ve also attached the code for the USER_INDEX canister and logs for your review. I appreciate your assistance and look forward to any guidance you may provide.
You seem to call registerWasmArray as inter canister calls on several different canisters.
To troubleshoot, have you tried calling the registerWasmArray function on each canister individually (not as an inter-canister call) to see if they time out?
I tried making individual calls but it seems to present the same error.
doing several tests I realized that the failure is caused by the http request to github to get the wasmmodule.
I tried to execute the method in other canisters and I found that it works only for those requests that have a smaller size in bytes, otherwise it seems to be executed several times autonomously and times out
Thank you for the insight. After reviewing the http_service canister, it looks like you are attempting to get the WASM module from a GitHub API using HTTPS Outcalls.
What failure are you referring to? Do you get a specific error?
Have you tried to call the API directly without using HTTPS Outcalls? Do you run into an error?
Specifically speaking, the error I get is multiple execution of the code right at the point where the outcall request is attempted. However, I managed to find a solution:
The problem is that I have this method “getWasmModule” declared in a separate module called “ic_management_interface”. This method is called from the “user_index” actor, and when it is executed it happens multiple times and generates the time out error (I suppose it happens due to idempotence issues and that there is no consensus between the requests).
To solve it, I moved all the code of the “getWasmModule” method inside the “user_index” canister, so it seems that the multiple spontaneous execution that caused the time out did not continue to occur.
-before
/// register wasm module to dynamic users canister, only admin can run it
public shared({ caller }) func registerWasmArray(): async() {
_callValidation(caller);
// register wasm
wasm_module := await IC_MANAGEMENT.getWasmModule(#users("users"));
// update deployed canisters
for ({ canister_id } in usersDirectory.vals()) {
await IC_MANAGEMENT.ic.install_code({
arg = to_candid();
wasm_module;
mode = #upgrade;
canister_id;
});
};
};
-after
/// register wasm module to dynamic users canister, only admin can run it
public shared({ caller }) func registerWasmArray(): async() {
_callValidation(caller);
let wasmModule = await HTTP.canister.get({
url = "https://raw.githubusercontent.com/Cero-Trade/CeroTrade-IREC-LATAM/" # T.githubBranch() # "/wasm_modules/users.json";
port = null;
uid = null;
headers = []
});
let parts = Text.split(Text.replace(Text.replace(wasmModule, #char '[', ""), #char ']', ""), #char ',');
let wasm_array = Array.map<Text, Nat>(Iter.toArray(parts), func(part) {
switch (Nat.fromText(part)) {
case null 0;
case (?n) n;
}
});
let nums8 : [Nat8] = Array.map<Nat, Nat8>(wasm_array, Nat8.fromNat);
// register wasm
wasm_module := Blob.fromArray(nums8);
// update deployed canisters
for ({ canister_id } in usersDirectory.vals()) {
await IC_MANAGEMENT.ic.install_code({
arg = to_candid();
wasm_module;
mode = #upgrade;
canister_id;
});
};
};
I thought the same thing too, but I find it curious that some requests worked and others didn’t, then I noticed that the only difference between other requests and this one, is that here many prints were executed at the same time as if there were multiple calls where they shouldn’t be. So I think the real problem occurred because of this concurrency.
Maybe. My canisters had a bought of 2 hours of the same error but different code and then they became responsive again. If it’s working now and you roll your code back and it continues to work then it’s probably the subnet. There are a number of other posts in the forums that reference the same error message so im betting its related to the subnet
It seems that when I return the code to how it was before it continues to return time out, it is only fixed by applying the modifications I made before
After many tests I managed to gather some interesting information. First of all the error started because of executing http outcall request to github without handling idempotence, for some reason this causes the loop execution of the method in question for example:
actor {
public fun anyFunction() {
module.concurrentFunction();
}
}
module {
public fun concurrentFunction() {
await HTTP.canister.get({
url = "someurl";
port = null;
headers = []
});
Debug.print("this will be printed in a loop if idempotence is not managed through a proxy");
}
}
As I mentioned before the solution is to use some proxy to handle the repeated requests and avoid the loop being triggered in the motoko module.
Now, once the function falls into a loop, it will cause the canister to maintain an infinite process which, based on the size in bytes of the response, will cause imminent instability in the entire canister. This is when any method you execute will time out all the time.
My canisters presented this problem. In order to solve it, I had to try to stop them with the command dfx canister stop <CanisterID> --network ic and wait several hours for it to take effect. Once the canister was stopped, we can modify any code that may be causing the loop. I tried to find a method to stop the canisters faster, but I couldn’t find a way to force them to stop.
On the other hand, the number of http requests we make in a given period is currently only 1. This applies to all implemented methods, with the intention of reducing waiting times and costs. We are currently able to deploy the canisters without any timeout issues. I was testing today and have not encountered any more timeout issues so far.
As an investigation, I will detail below a step-by-step follow-up of the time-out errors that we found during the tests, which we attribute to network latency:
When trying to register a user we got this error message
This error doesn’t indicate that your canister is unresponsive. This CycleOps error actually indicates that someone else has added this same canister id to their account. Are you monitoring this canister in a different CycleOps account (individual/team account)? If so, for blackhole monitoring we only allow a single account to monitor a specific canister once you’ve added our blackhole as a controller and verified ownership of the canister in order to ensure data privacy.
If you want to transfer canisters between your accounts (from individual to team accounts), you can use canister projects and CycleOps teams to transfer canisters between accounts and preserve historical metric data.
Alternatively, you can delete the canister from the previous account and then add it to the new account (although the historical metric data for that canister will not carry over in this case).
Seems very strange, I have only 2 Internet Identities, none of them have registered this canister. However thank you for the response! Platform seems to be running ok, troubles seemed to come from the network, as my partner described, functions took a long time to run at some point.