Subnets with heavy compute load: what can you do now & next steps

icpp · October 6, 2024, 10:37pm

Will the update you describe help with the canister creation issues as well, or is it only for existing canisters?

I am trying hard to migrate my canisters to a less busy subnet, but so far I have not been able to create any canisters. The dfx create command always times out with the message:

Error: Failed to create canister 'blablabla'.
Caused by: The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0x7c1d6c6ce838dad2455bc33e5731252c9ccaebfc8fbc372dc776c8f5bfdd8bf8 timed out waiting to start executing., error code None

Manu · October 7, 2024, 7:07am

On which subnet are you trying to create canisters? From the error, it looks like you’re targeting a busy subnet, which is probably not what you want. Note that you can add a --subnet option to dfx canister create.

icpp · October 7, 2024, 7:19am

I tried several subnets without luck.

Just now tried this one:

w4asl-4nmyj-qnr7c-6cqq4-tkwmt-o26di-iupkq-vx4kt-asbrx-jzuxh-4ae

Manu · October 7, 2024, 7:24am

You can’t target all subnets (for some historic reasons that I hope we can get rid of soon). See https://dashboard.internetcomputer.org/proposal/132409 to figure out which subnets you can now deploy to.

HinzaAsif · October 7, 2024, 11:49am

Manu:

What can you do now
If you are a canister developer experiencing issues from being on such a busy subnet, there are two main things you can do.

migrate your canister to another subnet, by creating a new canister on a more quiet subnet (you can see the load of subnets on the dashboard , key things to look at are canister count, million instructions executed per second, and state), and migrating the state of your canister over to the new subnet (which requires support on your canister, there currently is no built in support for this).

Set a compute allocation . A compute allocation is a way for a canister to reserve compute capacity. They are expressed in percentage point, so setting a compute allocation of 5 means your canister will be guaranteed run 5% of the rounds, so at least once every 20 blocks, even if there are many other canisters that want to compute. The cycles cost is roughly 35$ per month per percent point for a compute allocation on a 13-node subnet.

@Manu
Sorry to say, these are not solutions for the running application. If you migrate your canister to another subnet, what will happen to the database, all the previous data, wallet connections, and efforts? This could potentially destroy the content-based DApp.

Regarding the two options: Is the cost increase a one-time thing, or will it be permanent?

I am totally confused—do I risk my data or face increased costs? It’s mentally exhausting.

icpp · October 7, 2024, 10:55pm

@Manu , thank you, I found a subnet where I can create my canisters. I do not have any state to move over, so this is a solution for my situation,.

rbole · October 8, 2024, 5:01am

Hey @Manu I would like to ask when the proposal will effect the european subnet and we can see an improvement. Currently the storage of a string takes very long, see https://hengx-riaaa-aaaas-ajw5a-cai.icp0.io/

Manu · October 8, 2024, 6:52am

Great!

The replica version proposal took a bit of discussion (see this forum thread) but has now been adopted, so it will start getting rolled out to subnets soon. Typically all subnets get the newly elected version during the week, so either today or in the coming days.

dfxjesse · October 8, 2024, 10:15am

It would be great if we can get a confirmation of when the new updates are rolled out onto the subnets. I have canisters on some of the affected subnets and they are barely moving now:

From looking at the dashboard it looks like Yral are spinning up a lot of new canisters with timers / heartbeats? I’ve been on one of these subnets for nearly 2 years and it’s been fine during that time and now it’s severely lagging and calls are failing. I agree a lot with this point:

Bautista1999 · October 8, 2024, 2:37pm

I am having a similar situation as yourself. Ive been in this subnet for a year now and I never had a single problem. I cant migrate due to my data being hosted by juno, and Juno doesnt support that. Is there way dfinity can designate a special subnet for Yral for example?

I am currently unable to do update calls to my satellite because of this issue, and my app doesnt have that many requests for it to cause an issue.

icpp · October 8, 2024, 2:49pm

I thought I found a subnet where I could redeploy my canisters, but it deployed two canisters, and then it started failing again.

I feel Yral needs to be banned from (or kindly requested not to take over) certain subnets so we can build again.

For now, I am going to stop trying to deploy to mainnet. Wasted so many hours on this.

Bautista1999 · October 8, 2024, 3:08pm

Great news! Thanks Manu. Excited to see those improvements!

frederico02 · October 8, 2024, 3:09pm

Seems like everyone is affected by this issue. I hope DFINITY are making this their top priority right now because a lot of the canisters i have had about 2 hours of outage today. I dont want to come across as overly pessimistic but this really does need to be the top priority. How can anyone build and maintain an active user base when you can’t interact with the canisters properly.

Manu · October 8, 2024, 7:13pm

Unfortunately the replica version that was just elected did not offer the improvements that we hoped for yet. The main issue seems to be that the changes to the scheduler indeed make it more “fair” in the sense that it cycles through the canisters much more quickly. However, this quicker cycling through canisters leads to significant memory contention on the replicas, making the actual execution more than an order of magnitude slower.

The dfinity team has ideas to address this issue and is treating this as their top priority, but unfortunately it still means that we’ll have to wait a bit longer to see the situation on busy subnets improve.

Lorimer · October 8, 2024, 7:43pm

Thanks for keeping the community updated Manu. I guess growing pains are the cost of being at the bleeding edge. I’m interested to see how this will be resolved in an upcoming replica release.

frederico02 · October 9, 2024, 6:10pm

Hoping something can be done soon. Our backend apps experienced an outage today st the worst possible moment ( during a weekly token distribution ). We have measures in place to make sure these tokens do eventually get distributed but it doesn’t look good to our community. Please DFINITY, do something soon

ICPSwap · October 10, 2024, 9:36am

All ICPSwap DApps and swap pool canisters are located on the lhg73 subnet (lhg73-sax6z-2zank-6oer2-575lz-zgbxx-ptudx-5korm-fy7we-kh4hl-pqe).

It’s true that the swap speed has slowed down quite a bit recently. Hope things get better soon!

TusharGuptaMm · October 10, 2024, 1:53pm

One effective approach would be to allocate subnets for compute-heavy canisters or to limit the number of these canisters (such as those with heavy compute demands, like heartbeat canisters) within a single subnet. This would help prevent subnet overload and ensure that each DApp receives its fair share of compute resources without incurring excessive costs.

RuBaRu being SocialFi DApp, we also aim to onboard a large user base, currently having over 200+ canisters in the network. However, we’re facing challenges with frequently failing update calls. Increasing compute allocation doesn’t seem viable, nor is migrating to another subnet feasible. We would love to have a long-term solution. If there’s a short-term solution the team can suggest, that would be great as well. This way, we can restart our onboarding campaigns.

Manu · October 11, 2024, 9:50am

Quick update: DFINITY plans to propose two distinct replica versions today. One “standard” one and one targeted at improving the situation on busy subnets. This version includes similar scheduler changes as I mentioned in the first post on this thread, and which caused a slowdown on the first busy subnet fuqsr this week. We believe comes from memory contention in the replica, stemming from the fact that the modified scheduler rotates more quickly through all canisters. In this week’s version, we include additional changes that should help prevent this memory contention. Hopefully we’ll see improvement from that next week, and the dfinity team is still working very hard to further improve how the replica handles high compute load from many small messages.

TusharGuptaMm · October 11, 2024, 12:04pm

Good news & progress @Manu. Thank you for the update.

Topic		Replies	Views
LAMENT: A tale of constant struggle of what it's like trying to scale on ICP Developers	73	3620	November 3, 2024
Suggested measures to reduce latency and improve ICP scalability Developers	48	1324	November 4, 2024
High User Traffic Incident Retrospective - Thursday September 2, 2021 Developers	50	8981	October 30, 2021
Fixing incorrect message memory fee Developers	25	2150	October 6, 2023
Path forward for subnet splitting and protocol scaling Developers	19	375	October 17, 2024

Subnets with heavy compute load: what can you do now & next steps

Related topics