Subnets with heavy compute load: what can you do now & next steps

The aim is not to put a hard cap on batch / background workloads. It’s simply to balance (not even prioritize, I got a bit carried away there) interactive and batch workloads. So (to take a simplistic example) if you have 100 canisters with ingress messages and 300 canisters with heartbeats, you give 1 CPU core to the former and 3 CPU cores to the latter. Or something along those lines.

Because the problem we were (and to some extent still are) seeing is that heartbeats are basically starving interactive requests. Before the scheduler improvements, we had subnets where it took us thousands of rounds to go over all canisters, mostly executing heartbeats that did no work; while we had backlogs of ingress messages (that we had been able to handle just fine days before) building up into the thousands and timing out. Had the ingress messages only had to compete against other ingress messages, we would have had no user visible issues; and it would have only marginally affected the throughput of heartbeats.

Regardless, this was just a discussion, nothing materialized. We’re still thinking through the alternatives that are open to us, now that we also have a better idea of the issue at hand ourselves.

4 Likes

Hi @Manu,

It was working fine, but now this error has returned. Is yinp6-35cfo-wgcd2-oc4ty-2kqpf-t4dul-rfk33-fsq3r-mfmua-m2ngh-jqe not upgraded yet?

Still experiencing the ingress expiry issue with yinp6… Will the new replica version fix it?

That error message has nothing to do with ingress expiry due to subnet load (which would happen 5 minutes after you submitted it, not immediately). It simply says that you submitted an ingress message with an expiration time in the past. The wall time on the replica that you sent the ingress message to was 15:53:27 UTC. It was willing to accept ingress messages with an ingress_expiry value set to anything between 15:53:27 UTC and 15:58:57 UTC. Whereas the ingress message you submitted had ingress_expiry equal to 15:53:00 UTC. I.e. it was already expired because you said so. You essentially told the replica “do this for me and give me the response 30 seconds ago, at the latest”.

1 Like

What would you recommend (using the js agent)? Ingress expiry issue

Hi Xalkan
@free is correct, the expiry set in the request you sent to the replica was too old.
What tools were you using which exhibited this behavior? dfx? browser with candid? Maybe it’s due to an old agent, we recently fixed some bugs related to this. If you’re using agent-js check if it’s 2.1.3

1 Like

Hi Yvonne,

It’s a Next.js app using Juno’s core-peer (about to update it to its latest version, which includes agent-js v2.1.3 :crossed_fingers:).

Assuming your machine’s time is synced, then with the update to 2.1.3 the problem should go away. Let us know if it persists.

2 Likes