Subnets with heavy compute load: what can you do now & next steps

For me it’s problematic to just increase prices if something is overloaded. There should be better mechanics e.g. something like a load balancer that throttles canisters if necessary.
ICP strives to be a world computer so it will have to deal repeatedly with high load. Is in the future the answer increasing costs, as well? Well, I hope not…doesn’t scale if the main competitor (=web2) is eventually cheaper…

7 Likes

Hello team, so i have noticed a strange behaviour recently, update calls which are registered as generic proposals in SNS and contains calls to Management canister have started failing.
When i started testing on staging canisters for same update calls which contains calls to IC management canister, i am getting error which says :

Call failed:
Canister: na2jz-uqaaa-aaaal-qbtfq-cai
Method: updateWorldCanisterSettings (update)
"Request ID": "6156eb25676b8b639ff876e1f3271f8878ccd8ffaebfcbf05bc28582a692479f"
"Reject code": "undefined"
"Reject message": undefined

I have seen there is some discussion related to this above already but i dont think its related because i never set compute allocation to any canisters, so by default it should still be 0. Also I do not add any cycles on these update calls until now and it used to work absolutely fine. So i wanted to know if there has been any changes to this recently?

Subnet on which canisters exist : 3hhby-wmtmw-umt4t-7ieyg-bbiig-xiylg-sblrt-voxgt-bqckd-a75bf-rqe

Let me know if there are any other details i can provide.

1 Like

+1 for this question

2 Likes

Whats the name of your project?

Hi @h1teshtr1path1,

Could you clarify a few things, please:

  1. When you say “staging canisters,” do you mean that na2jz-uqaaa-aaaal-qbtfq-cai is a directly controlled instance of a canister, and there’s another instance of the came canister controlled by your SNS?

  2. It seems that this canister is on the subnet e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe, so I’m a bit confused about this remark:

    https://dashboard.internetcomputer.org/subnet/3hhby-wmtmw-umt4t-7ieyg-bbiig-xiylg-sblrt-voxgt-bqckd-a75bf-rqe

    Could you clarify which canisters are on which subnet?

  3. Which SNS does the failed generic proposals belong to? Is this Boom DAO?

Subnet bkfrj was also upgraded today and the results look good so far.

7 Likes

Can you also clarify which agent you are using to make these calls? Is this the Rust agent? JS agent? Thank you.

Thank you very much, it works! The duration of a store request takes between 3 and 15 seconds. I will monitor this more closely tomorrow.

2 Likes

Here you can find a test-project to test how long it takes to store a string on the European Subnet:

The results vary widely, ranging from 3 to 26 seconds, with no clear pattern or consistency.

It’s great that we can use the European subnet again, but we need to return to a storage duration of 2-3 seconds, as it was before.

You can find the GitHub repository here: GitHub - samlinux-development/ic-performance: Measures the store’s performance on the European subnet.. This allows you to install the project on a different subnet to measure storage duration.

1 Like
  1. Right @aterga . na2jz-uqaaa-aaaal-qbtfq-cai is staging canister and same instance of canister is under control of Boom Dao Sns as well, its canister id : js5r2-paaaa-aaaap-abf7q-cai, this SNS canister is on subnet : 3hhby-wmtmw-umt4t-7ieyg-bbiig-xiylg-sblrt-voxgt-bqckd-a75bf-rqe
  2. Staging canister na2jz-uqaaa-aaaal-qbtfq-cai is on subnet id : e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe
  3. Yes Boom DAO.

I was actually using Candid Ui to run the endpoint. So its agent-js.

But now i tried again with Staging canister which is controlled by me, its working fine. But SNS controlled canister : js5r2-paaaa-aaaap-abf7q-cai, still have problem and get failed when executing General proposal - 1100 on Boom Dao SNS.

Hey Manu, I tested it over the weekend and can say that an update call takes more than 15 seconds, some even over 20 seconds.

When can we get back to the 2-3 second like that Dfinity mentioned?

1 Like

Sadly, it’s the same for us. We’re on lhg73.

We use k44fs, and due to congestion, our Dapp user experience is so poor that our team dares not promote it on a large scale. We cannot receive a response, and we have also performed a timeout detection ourselves. We hope to solve this problem as soon as possible.

1 Like

Hey folks, indeed some subnets are still experiencing some issues in the form of increased latency for update calls as some of you reported here (but significantly better than last week where a lot of messages would expire before being executed). The ones affected according to our internal metrics are bkfrj, lhg73, opn46 and k44fs with latency in the order of 20-30secs. We’re exploring options of how to further improve the situation for those subnets and bring them closer to the more expected 2-3 seconds.

FWIW,

We cannot receive a response, and we have also performed a timeout detection ourselves.

This is not something we see on our side. Calls take longer to complete in k44fs but we do not see frequent timeouts.

5 Likes

Please see this post which captures DFINITY’s view on next steps wrt heavily loaded subnets and ICP scalability.

7 Likes

Hey @Manu, thanks for the detailed post! As a developer, I’d like to know when we can start using the European Subnet again for CRUD applications. With an execution time between 3 and 21 seconds, it’s currently not feasible to replace an existing Web2 application on the Internet Computer.

Could you clarify when the proposed measures will achieve the targeted 2-second update time?

I think the realistic answer is that we can’t guarantee that every subnet will always have a ~2 second update latency, because every subnet has bounded capacity, and if the loads grows big enough, it will exceed what the subnet can do and certain messages have to wait.

I hope that the community will adopt the cycles cost adjustments that DFINITY proposed, which may lead to reduced load, especially in the type of workload that we now see incur a lot of strain on the subnets. This would make it less likely that subnets are overloaded. These changes are easy to make, so they can happen as soon as the ICP community decides that they want this.

Separately, the replica improvements are being worked on and can hopefully increase the capacity of each subnet, so that may help in the coming weeks.

Finally, adding more subnets (topic) would increase the capacity of ICP. Out of curiosity, do you want your canisters to run on the european subnet, or could you migrate to another subnet? If you really want a european subnet, perhaps it would be good to mention that, maybe we should explore if a second european subnet should be created.

Finally, canister migration would help fully distribute load across all subnets and should remove the problem of highly loaded subnets (as canisters would just move to a quieter subnet). However, this will be in the order of months, not weeks.

3 Likes

Thanks for the information!

We’re an Austria-based company, and GDPR is a significant concern for us. We plan to migrate several applications from Web2 to the Internet Computer. By law, we’re required to store PII data within the European Union — I’m sure you’re aware of that.

To be honest, achieving this with a 2-second update time is already challenging and it’s simply not feasible with delays of 5 to 20 seconds.

In general, I don’t think increasing the price is the right approach. The Internet Computer aims to be a “world computer,” and it should be capable of handling a massive amount of load. But that’s only my opinion!

A second European Subnet is a good idea, but what happens if “Bob2” takes over this subnet again? For me, addressing this kind of issue is a crucial aspect for the Internet Computer to be accepted as a true alternative to web infrastructures in the medium and long term.

What does it really mean when you say the subnet is overloaded? The European Subnet currently has a transaction rate of 150-200 TX/s. Additionally, the issues you’ve described in previous posts suggest a systemic problem rather than an overload problem. In my opinion, the overload is only a symptom, not the cause.

3 Likes

As Manu said, the proper way to address this is with canister migration (ideally, along with some way for canister controllers to designate e.g. which canisters should stay together on the same subnet and which such groups of canisters should be hosted by different subnets; plus rules such as “only on GDPR subnets”). Then we could automatically balance load across the whole of the IC. However, something like this is (very optimistically) at least a year out even if we start working on it full time starting on Monday.

The systemic problem you are hinting at is simply the combination of limited resources (after all a subnet is simply a replicated virtual machine that must run, with all of the overhead of consensus, deterministic computation, virtualization, encryption, sandboxing, etc. on top of a replica machine); and up to 100k competing canisters crammed onto a subnet. If you think of a canister as a process (which is what it is) then there is no way to give even 1% of said canisters a reasonable chunk of the CPU, should they all want it at the same time. Try running 1k processes each doing a reasonable amount of active computation on your laptop/desktop (the comparison is more or less fair, because of all the overhead I mentioned).

As things stand, 100k canisters can only work for something like OpenChat, who have their own subnets and thus can more or less control the load. On the average subnet, where there is zero control over how the load is spread among tens of thousands of canisters belonging to thousands of controllers, the only way to deal with spikes (or persistent load, such as the Bob canisters) is to be able to more or less instantly shift load across subnets. There’s nothing more to it than that (except smaller or larger optimizations to be done here or there, but none of which would actually solve the problem of thousands of canisters suddenly deciding they all want sub-2 second latency all at once).

5 Likes