Subnets with heavy compute load: what can you do now & next steps

@Manu ,

Will the update you describe help with the canister creation issues as well, or is it only for existing canisters?

I am trying hard to migrate my canisters to a less busy subnet, but so far I have not been able to create any canisters. The dfx create command always times out with the message:

Error: Failed to create canister 'blablabla'.
Caused by: The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0x7c1d6c6ce838dad2455bc33e5731252c9ccaebfc8fbc372dc776c8f5bfdd8bf8 timed out waiting to start executing., error code None

On which subnet are you trying to create canisters? From the error, it looks like you’re targeting a busy subnet, which is probably not what you want. Note that you can add a --subnet option to dfx canister create.

1 Like

I tried several subnets without luck.

Just now tried this one:

  • w4asl-4nmyj-qnr7c-6cqq4-tkwmt-o26di-iupkq-vx4kt-asbrx-jzuxh-4ae

You can’t target all subnets (for some historic reasons that I hope we can get rid of soon). See https://dashboard.internetcomputer.org/proposal/132409 to figure out which subnets you can now deploy to.

3 Likes

@Manu
Sorry to say, these are not solutions for the running application. If you migrate your canister to another subnet, what will happen to the database, all the previous data, wallet connections, and efforts? This could potentially destroy the content-based DApp.

Regarding the two options: Is the cost increase a one-time thing, or will it be permanent?

I am totally confused—do I risk my data or face increased costs? It’s mentally exhausting.

7 Likes

@Manu , thank you, I found a subnet where I can create my canisters. I do not have any state to move over, so this is a solution for my situation,.

1 Like

Hey @Manu I would like to ask when the proposal will effect the european subnet and we can see an improvement. Currently the storage of a string takes very long, see https://hengx-riaaa-aaaas-ajw5a-cai.icp0.io/

1 Like

Great!

The replica version proposal took a bit of discussion (see this forum thread) but has now been adopted, so it will start getting rolled out to subnets soon. Typically all subnets get the newly elected version during the week, so either today or in the coming days.

5 Likes

It would be great if we can get a confirmation of when the new updates are rolled out onto the subnets. I have canisters on some of the affected subnets and they are barely moving now:

From looking at the dashboard it looks like Yral are spinning up a lot of new canisters with timers / heartbeats? I’ve been on one of these subnets for nearly 2 years and it’s been fine during that time and now it’s severely lagging and calls are failing. I agree a lot with this point:

3 Likes

I am having a similar situation as yourself. Ive been in this subnet for a year now and I never had a single problem. I cant migrate due to my data being hosted by juno, and Juno doesnt support that. Is there way dfinity can designate a special subnet for Yral for example?

I am currently unable to do update calls to my satellite because of this issue, and my app doesnt have that many requests for it to cause an issue.

3 Likes

I thought I found a subnet where I could redeploy my canisters, but it deployed two canisters, and then it started failing again.

I feel Yral needs to be banned from (or kindly requested not to take over) certain subnets so we can build again.

For now, I am going to stop trying to deploy to mainnet. Wasted so many hours on this.

5 Likes

Great news! Thanks Manu. Excited to see those improvements!

2 Likes

Seems like everyone is affected by this issue. I hope DFINITY are making this their top priority right now because a lot of the canisters i have had about 2 hours of outage today. I dont want to come across as overly pessimistic but this really does need to be the top priority. How can anyone build and maintain an active user base when you can’t interact with the canisters properly.

5 Likes

Unfortunately the replica version that was just elected did not offer the improvements that we hoped for yet. The main issue seems to be that the changes to the scheduler indeed make it more “fair” in the sense that it cycles through the canisters much more quickly. However, this quicker cycling through canisters leads to significant memory contention on the replicas, making the actual execution more than an order of magnitude slower.

The dfinity team has ideas to address this issue and is treating this as their top priority, but unfortunately it still means that we’ll have to wait a bit longer to see the situation on busy subnets improve.

16 Likes

Thanks for keeping the community updated Manu. I guess growing pains are the cost of being at the bleeding edge. I’m interested to see how this will be resolved in an upcoming replica release.

4 Likes

Hoping something can be done soon. Our backend apps experienced an outage today st the worst possible moment ( during a weekly token distribution ). We have measures in place to make sure these tokens do eventually get distributed but it doesn’t look good to our community. Please DFINITY, do something soon :melting_face:

1 Like

All ICPSwap DApps and swap pool canisters are located on the lhg73 subnet (lhg73-sax6z-2zank-6oer2-575lz-zgbxx-ptudx-5korm-fy7we-kh4hl-pqe).

It’s true that the swap speed has slowed down quite a bit recently. Hope things get better soon!

7 Likes

One effective approach would be to allocate subnets for compute-heavy canisters or to limit the number of these canisters (such as those with heavy compute demands, like heartbeat canisters) within a single subnet. This would help prevent subnet overload and ensure that each DApp receives its fair share of compute resources without incurring excessive costs.

RuBaRu being SocialFi DApp, we also aim to onboard a large user base, currently having over 200+ canisters in the network. However, we’re facing challenges with frequently failing update calls. Increasing compute allocation doesn’t seem viable, nor is migrating to another subnet feasible. We would love to have a long-term solution. If there’s a short-term solution the team can suggest, that would be great as well. This way, we can restart our onboarding campaigns.

4 Likes

Quick update: DFINITY plans to propose two distinct replica versions today. One “standard” one and one targeted at improving the situation on busy subnets. This version includes similar scheduler changes as I mentioned in the first post on this thread, and which caused a slowdown on the first busy subnet fuqsr this week. We believe comes from memory contention in the replica, stemming from the fact that the modified scheduler rotates more quickly through all canisters. In this week’s version, we include additional changes that should help prevent this memory contention. Hopefully we’ll see improvement from that next week, and the dfinity team is still working very hard to further improve how the replica handles high compute load from many small messages.

20 Likes

Good news & progress @Manu. Thank you for the update.