Timer canister delayed by 23 minutes, then stuck in a stopping state for ~30 minutes

This evening at around 1AM UTC (October 4th) our timer canister on 6pbhf-qzpdk-kuqbr-pklfa-5ehhf-jfjps-zsj6q-57nrl-kzhpd-mu7hc-vae was delayed by 23 minutes.

For reference, this is one of the subnets listed in Subnets with heavy compute load: what can you do now & next steps

Log of events

October 4th, 1:02AM UTC
I noticed that the timer canister was delayed when a regularly scheduled process didn’t kick off at it’s normally scheduled time.

1:05AM UTC
I then attempted to call into the timer canister, but received sys transient and timeout errors:

% dfx canister --ic call topup_timer healthCheck
Error: Failed update call.
Caused by: Failed update call.
  The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0xa477e20fcded503a4004a8531f18ddddf4fae340b37eaf8b60c42821d5eafcf6 timed out waiting to start executing., error code None
% dfx canister --ic call topup_timer healthCheck
Error: Failed update call.
Caused by: Failed update call.
  The request timed out.

Our timer canister calls into our main canister (on a different subnet). So I noticed that our main canister had not yet received a call from the timer canister, no canister logs on the main canister had accumulated, and we had not yet started execution of the usually scheduled process.

1:08AM UTC
I then attempted to manually trigger execution of the process on our main canister and received the same error.

% dfx canister --ic call ...(manual api on canister)
Error: Failed update call.
Caused by: Failed update call.
  The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0x34323590b6cea47215e3fc098ef9eefde3a4c0629de28c9bfa99b9164a96cb66 timed out waiting to start executing., error code None

I then stopped and restarted our main canister, and then tried manually running the process once more, only to get the same error :thinking:

At this point I was totally baffled at what’s happening :thinking:

~1:12-1:15 AM
Then, I tried stopping/restarting the timer canister, but wasn’t able to stop the timer canister as I received the same timeout as before.

% dfx canister --ic stop topup_timer     
Stopping code for canister topup_timer, with canister_id uekfe-taaaa-aaaal-aca7a-cai
Error: Failed to stop canister uekfe-taaaa-aaaal-aca7a-cai.
Caused by: Failed to stop canister uekfe-taaaa-aaaal-aca7a-cai.
  Failed to call update function 'stop_canister' regarding canister 'uekfe-taaaa-aaaal-aca7a-cai'.
    Update call (without wallet) failed.
      The replica returned a rejection error: reject code SysTransient, reject message Stop canister request timed out, error code None

1:23AM
Confused at what could be happening, I was about to deploy a rollback to our main canister in case this was related to a code update when I checked the main canister once more. I noticed that the main canister’s job execution had just started :partying_face: . It took a few minutes and then completed successfully.

1:30AM
Next, I went back to the topup timer canister and found that it was stuck in a stopping state. It stayed in that state for roughly 30 minutes.

2:05AM
The topup timer canister finally stopped, and was able to be restarted without any issues.

What next?

At this point, I’m considering a few solutions, from just letting the timer be slow periodically to spinning up this specific timer canister service on a different subnet (not difficult), but also don’t want to have to abandon ship on the current canister if this was just a spike and the load could come down soon.

Ideally, I’d prefer solutions that don’t involve bumping compute allocation.

Is this mainly related to recent subnet load that should be resolved with next week’s scheduler improvement? Or is it worth it for developer teams to spend some time to futuring proof timers/expected computations (i.e. by periodically cycling up compute allocation to the canister before I need to run a task and then bringing the allocation back down).

Other error information from the past day, potentially unrelated?

Most of the past day our external monitoring of our service has hit ingress limits, resulting in regular 400 errors, demonstrating that either the IC or boundary nodes are experiencing serious load. We run a health check of our canisters once an hour. In the 15 hours between October 3rd 1PM UTC and October 4th 2AM UTC we’ve received errors from at least one canister in 11/15 (73%) of the hourly checks. The errors look like this:

Error: Server returned an error:
  Code: 400 (Bad Request)
  Body: Specified ingress_expiry not within expected range: Minimum allowed expiry: 2024-10-04 01:12:02.849247150 UTC, Maximum allowed expiry: 2024-10-04 01:17:32.849247150 UTC, Provided expiry:        2024-10-04 01:11:23.794 UTC
2 Likes

Boy, do I have some logs to share with you.

For my research of the last couple weeks, What you are experiencing is likely the behavior you should expect to see with any sort of significant population or activity on the Internet computer with the current set up.

I have not only had timers delayed by 23 minutes, but I have caused them to be delayed by the additional 23 minutes on purpose Is there an attempt to tilt the game in my favor. I couldn’t quite get enough control over it to actually get the advantage, but I could tell that I was actually affecting the use curve of the Bob swarm.

This is what killed the original pBoB caniste:

Queue up too many of those futures We’ve been playing with(150 or so) On a subnet that’s only getting to process a few of them every 23 minutes and is doing a back-and-forth with your regularly scheduled timer and you could be waiting days for your canister to stop. At a certain point compute allocation doesn’t even help because you’re waiting on calls to come back from other canisters that have a compute allocation of zero that you can’t affect.

Fun times in Computation architecture!

(and this was all in the same subnet And I didn’t have to worry about the ingress issue, which I’ve been suspicious of since I ran my experiments. Cross canister Could get weird.

2 Likes

Yes I would say it’s related to the load these subnets experience. We believe that the next RC will contain some fixes that will improve this experience.

I don’t think that would be the right approach. The system should give you some better behaviour without you having to mess around with compute allocation this way. That was not the intent of having compute allocations, rather it was to give you a way to reserve compute in case you really need it.

We do plan to keep looking into a better story around handling load more gracefully. There are certainly things we can improve on the platform level.

3 Likes

I’ll note that the @PanIndustrial TimerTool can help this a bit by filing multiple executions under one proxy timer call. Whenever the library is invoked it pulls all set timers with targets less than the current timestamp and attempts to execute them with noop self awaits if you specify that(which in some instances execute mostly in sync but I’m not sure how a Bob situation affects that…ie I don’t know if motoko is doing some virtual scheduling before handing back to the replica scheduler or not…I think there is virtual scheduling based on some issues I’ve seen in GitHub) between such that if one fails it isn’t removed from the queue. Or you can force only sync items and specify an upper limit on tasks to handle during one round*.

The down side is you have to mange your own recurrence. This is both good and bad. Good so that if you are 23 minutes late on a one minute recurring timer you don’t have 23 others scheduled*. Bad in that you have to mange it, but the code to do so is trivial.

  • My understanding of the way motoko does recurring times is that like 10 are scheduled ino a priority queue that is picked up on a callback to schedule the next one as @ggreif explains here Motoko - Timers - Specific Behavior - #3 by ggreif. I’m assuming that stale queue items that have a more fresh brother invocation are dropped, but I’d love to get confirmation on that given the whole BOB situation.

Using native motoko timers, if you schedule 10 different tasks with 10 timers at 10 different times during a delay cycle I’d expect that currently possibly only the first one will get scheduled(especially if you have other ingress messages) and you might have to wait for 23 minutes to see the next one(or longer if you have a lot of ingress as timers can get de-prioritized.)

It would be really, really nice to have a detailed flow diagram of the scheduler that includes an sdk layer.

  • I need to check out the new finally as I may be able to now recover and properly recover if finally is run even if you hit the cycle limit.