This evening at around 1AM UTC (October 4th) our timer canister on 6pbhf-qzpdk-kuqbr-pklfa-5ehhf-jfjps-zsj6q-57nrl-kzhpd-mu7hc-vae was delayed by 23 minutes.
For reference, this is one of the subnets listed in Subnets with heavy compute load: what can you do now & next steps
Log of events
October 4th, 1:02AM UTC
I noticed that the timer canister was delayed when a regularly scheduled process didn’t kick off at it’s normally scheduled time.
1:05AM UTC
I then attempted to call into the timer canister, but received sys transient and timeout errors:
% dfx canister --ic call topup_timer healthCheck
Error: Failed update call.
Caused by: Failed update call.
The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0xa477e20fcded503a4004a8531f18ddddf4fae340b37eaf8b60c42821d5eafcf6 timed out waiting to start executing., error code None
% dfx canister --ic call topup_timer healthCheck
Error: Failed update call.
Caused by: Failed update call.
The request timed out.
Our timer canister calls into our main canister (on a different subnet). So I noticed that our main canister had not yet received a call from the timer canister, no canister logs on the main canister had accumulated, and we had not yet started execution of the usually scheduled process.
1:08AM UTC
I then attempted to manually trigger execution of the process on our main canister and received the same error.
% dfx canister --ic call ...(manual api on canister)
Error: Failed update call.
Caused by: Failed update call.
The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0x34323590b6cea47215e3fc098ef9eefde3a4c0629de28c9bfa99b9164a96cb66 timed out waiting to start executing., error code None
I then stopped and restarted our main canister, and then tried manually running the process once more, only to get the same error
At this point I was totally baffled at what’s happening
~1:12-1:15 AM
Then, I tried stopping/restarting the timer canister, but wasn’t able to stop the timer canister as I received the same timeout as before.
% dfx canister --ic stop topup_timer
Stopping code for canister topup_timer, with canister_id uekfe-taaaa-aaaal-aca7a-cai
Error: Failed to stop canister uekfe-taaaa-aaaal-aca7a-cai.
Caused by: Failed to stop canister uekfe-taaaa-aaaal-aca7a-cai.
Failed to call update function 'stop_canister' regarding canister 'uekfe-taaaa-aaaal-aca7a-cai'.
Update call (without wallet) failed.
The replica returned a rejection error: reject code SysTransient, reject message Stop canister request timed out, error code None
1:23AM
Confused at what could be happening, I was about to deploy a rollback to our main canister in case this was related to a code update when I checked the main canister once more. I noticed that the main canister’s job execution had just started . It took a few minutes and then completed successfully.
1:30AM
Next, I went back to the topup timer canister and found that it was stuck in a stopping state. It stayed in that state for roughly 30 minutes.
2:05AM
The topup timer canister finally stopped, and was able to be restarted without any issues.
What next?
At this point, I’m considering a few solutions, from just letting the timer be slow periodically to spinning up this specific timer canister service on a different subnet (not difficult), but also don’t want to have to abandon ship on the current canister if this was just a spike and the load could come down soon.
Ideally, I’d prefer solutions that don’t involve bumping compute allocation.
Is this mainly related to recent subnet load that should be resolved with next week’s scheduler improvement? Or is it worth it for developer teams to spend some time to futuring proof timers/expected computations (i.e. by periodically cycling up compute allocation to the canister before I need to run a task and then bringing the allocation back down).
Other error information from the past day, potentially unrelated?
Most of the past day our external monitoring of our service has hit ingress limits, resulting in regular 400 errors, demonstrating that either the IC or boundary nodes are experiencing serious load. We run a health check of our canisters once an hour. In the 15 hours between October 3rd 1PM UTC and October 4th 2AM UTC we’ve received errors from at least one canister in 11/15 (73%) of the hourly checks. The errors look like this:
Error: Server returned an error:
Code: 400 (Bad Request)
Body: Specified ingress_expiry not within expected range: Minimum allowed expiry: 2024-10-04 01:12:02.849247150 UTC, Maximum allowed expiry: 2024-10-04 01:17:32.849247150 UTC, Provided expiry: 2024-10-04 01:11:23.794 UTC