LAMENT: A tale of constant struggle of what it's like trying to scale on ICP

peterparker · October 7, 2024, 1:55pm

I guess a new version of NNS dapp was executed and the menu point has been moved.

jamesbeadle · October 7, 2024, 2:03pm

Canister detail view has errors:

I’m definitely logged in.

Manu · October 7, 2024, 2:08pm

Could we please try to keep this thread on-topic? I think it would be very helpful if you could report that NNS FE issue in this thread.

Moschus · October 7, 2024, 4:04pm

@Manu I would like to hear your response to what @saikatdas0790 posted about problems with the solution proposed by Dfinity as an alternative to the new canister per user model they adopted. He seems to make a good case that the alternative Dfinity suggested was unworkable.

jamesbeadle · October 7, 2024, 5:36pm

So after the canister is created the following message appears on both apps:

Both apps have lost 50T cycles, but I don’t have a reference to the canisters created:

The bit of code that stores the reference is shortly after this is created, so not sure if / how i can get those cycles back.

The issue after the canister is created looks like it’s related to setting the computer_allocation to 1.

jamesbeadle · October 7, 2024, 5:42pm

Yeah so saving the team for the first user creates the canister which ultimately sets compute_allocation to 1:

Here is the code for setting compute allocation, which i call after the canister has been created:

public func updateCanister_(a : actor {}, backendCanisterController : ?Principal, IC : Management.Management) : async () {
    let cid = { canister_id = Principal.fromActor(a) };
    switch (backendCanisterController) {
      case (null) {};
      case (?controller) {
        await (
          IC.update_settings({
            canister_id = cid.canister_id;
            settings = {
              controllers = ?[controller];
              compute_allocation = ?1;
              memory_allocation = null;
              freezing_threshold = ?31_540_000;
              reserved_cycles_limit = null
            };
            sender_canister_version = null
          }),
        );
      };
    };
  };

It’s not growing in memory really, I’ve chunked a canister to take 12000 managers and this is the first one.

Manu · October 7, 2024, 6:25pm

             freezing_threshold = ?31_540_000;

@jamesbeadle As explained, you need to be able to pay for your compute allocation for the duration of your freezing threshold. You’re configuring your canister to have a freezing threshold of one year, so then you need to have the cycles to pay for 1 year of compute allocation.

jamesbeadle · October 7, 2024, 7:35pm

Hi Manu,

I’ve been reading this:

Specifically:

‘Compute allocation of 100% means that the canister will be scheduled every round. The current fee for 1 percent computer allocation per second is 10M cycles (or $0.0000133661 USD).’

So as I want compute on every round, this would cost me 1 billion cycles per second… or 31,536 Trillion cycles per year… with 6T cycles for 1 ICP… that’s 5,256 ICP a year… per canister…

If you could let me know where I’ve gone wrong that would be great!

berestovskyy · October 7, 2024, 8:19pm

James,
Because we’re increasing the freezing threshold to ~31,5M seconds (one year), the canister should have a minimum balance of at least 10M cycles multiplied by ~31.5M seconds, which equals ~315T cycles. This isn’t a payment; it’s the minimum balance we’ve specified in settings:

public func updateCanister_(...) : async () {
[...]
          IC.update_settings({
[...]
              freezing_threshold = ?31_540_000;

By reducing this freezing threshold, we can resolve the errors. See more details in my previous message.

jamesbeadle · October 7, 2024, 8:31pm

Yeah I understand the calculation there a lot more, thanks.

You didn’t address my post here? All those figures come from the docs.

berestovskyy · October 7, 2024, 9:01pm

I agree, it’s expensive because it’s an absolute allocation, meaning you’re guaranteed a specific CPU fraction (replicated 13+ times). And I strongly advocate for switching to relative priorities with more reasonable costs…

jamesbeadle · October 7, 2024, 9:14pm

If the subnet isn’t full, is absolute allocation guaranteed? Because I feel like we’ve always had absolute allocation…

I ran the system at the start of the season (end of August) and there weren’t any problems with speed, was running really smoothly. Fast forward to today, even before running the save which creates a dynamic canister, it’s running slowly on absolute guarantee. I have done a load of refactoring etc but even with my data caching it feels slow.

It’s almost like something between the end of August to now filled up subnets and caused all apps that were not requiring absolute guarantee to then require absolute guarantee to maintain their current user experience.

skilesare · October 7, 2024, 9:44pm

The difference between 20,000 canisters sitting there doing nothing and 20,000 canisters asking to set a timer to do something every 30 seconds is huge. In case 1, anyone who wants to get scheduled for every necessary round (when their timer goes off or when a user or canister sends them a message) is almost guaranteed to get slotted for executing. In the second case, your 30-second timer may only go off every 15 minutes(such is the case with the BoB subnet) because 20,000 canisters / 3 cores /15 slots every 30 seconds is 222 canisters competing for each slot and only 1 can get it. Thus they end up in a round robin and you get your chance every 15 minutes or so.

dfx-json · October 7, 2024, 9:49pm

We are reviewing the docs now for any errors or opportunities to clarify information

jamesbeadle · October 7, 2024, 10:22pm

Is it a case of move around to empty subnets to get a better execution priority? Where do we even check for this?

What filled up my subnet that wasn’t full for like a year? Just general apps eventually filling it up or did something like BOB go round and just make every old subnet full?

I’m trying to decide the best way to go forwards. Realistically, my app has more canisters than it needs, it was designed to ensure none ever filled up but I can get the 49 required to run for up to 12K managers down to 4 (Combining my 38 weekly leaderboard, 10 monthly and 1 season into 1).

I need to test the latency based on the compute_allocation, which can only be done live, I’ll start low and try and find the balance between ingress_expiry and allocation = 1.

Manu · October 8, 2024, 7:02am

The new load is coming from Yral spinning up many heartbeat canisters

jamesbeadle · October 8, 2024, 7:22am

I see, I thought they only started doing that recently, not weeks ago.

Gwojda · October 8, 2024, 10:05am

Hey guys,
Any update here ? Proposals passed, but nothing changed on our side. It’s even worst than before, a lot of our canisters are down/unreachable if we do not provide compute allocation. I imagine it’s also maybe related with the 160k canister Yral deployed since yesterday.
The scheduler update is already released or not yet ? Do you still plan to release this update this week?
Another question, if as we know Yral is consuming a lot, why we do not create new subnets for them ? Once Yral is consumming all ressource of one of their subnet, we create another one. As we know Yral is consuming a lot a ressource constantly, this should be the right way to handle that, no? Again, we have around 1k of available nodes, i dont understand why we do not scale up here.
Thanks

icme · October 8, 2024, 4:46pm

I’m pretty convinced that at this point, fewer, beefier canisters is the most efficient way to scale within a single subnet, and on the Internet Computer in general.

First off, I have a huge amount of respect for the engineering work that’s gone into building dynamic massively multi-canister (MMC) architectures on ICP. A lot of engineering blood, sweat, and sleepless nights has gone into scaling apps across multiple subnets . However, as shown over the past few weeks this architecture of spinning up tens of thousands of actively computing canisters on a single subnet inefficiently exhausts that subnet’s resources and does not scale well .

Spawning thousands of canisters that do roughly same thing slows down checkpointing & state sync, and slows down execution for other canisters (OpenFPL and everyone else). Adding a timer or redundant computation into all of those canisters that performs the same compute at the same time is inefficient from the protocol’s perspective, and is directly causing the slowdowns to other ICP apps these past few weeks. DFINITY engineering has been pretty clear about canisters per subnet and active canisters per subnet limitations in working groups, ICP.Labs, Global R&Ds, and communications on the forums over the past year and a half , so its not like these limits are new.

From building CanDB back in early 2022, I’m a huge fan of the actor model and the canister service architectures that dynamically spin up thousands/millions of canisters. While I understand the composable benefits of a canister per user mentioned by @saikatdas0790 above, these MMC architectures are complicated to build and maintain, are several orders of magnitude more expensive in terms of cycles costs, and don’t scale well performance-wise within an ICP subnet. In fact, now that Rust has stable structures and with Motoko’s upcoming Enhanced Orthogonal Persistence, I’d recommend that most devs start out with a single canister architecture and avoid premature splitting/optimization unless absolutely necessary.

If you’re an app that has chosen the dynamic 100k - 1million+ canisters per app path, then the only option you have moving forwards is to spill over onto other subnets. But this architecture is an incredibly inefficient use of resources, so instead of using the getting more out of the hardware and software you have in a single subnet, it’s making an inefficient use of multiple subnets .

As this “1M+ canister architecture” spills over into other subnets, since ICP subnets are shared resources the architecture’s inefficiency ends up starving canisters on those subnets of compute and slowing down the rest of the apps on the Internet Computer. In a way, regardless of the intentions of the application, rapidly spinning up canisters on ICP becomes the quickest and cheapest way to DDOS a subnet

DFINITY made a number of performance and scalability optimizations over the past two years that now allow a larger number of canisters per subnet, but these optimizations assume that most of the canisters on a subnet are idle within a single round of execution. As a messaging app where most of the daily active users are not constantly sending messages, OpenChat is a perfect example of what the subnet improvements are optimized for. This is why OpenChat can have 91k canisters on a single subnet while maintaining decent performance.

However, in the Yral/Bob case all of the canisters regularly perform repeated computations (index/ledger canisters, timer based algorithmic feed re-compute, etc.) then these performance optimizations don’t work, and ~20k regularly active and computing canisters will fully utilize a subnet. At this level of compute, it’s more efficient for subnet and the app if the app condenses into just a few canisters (1-100) per subnet. This has the added benefit of the app being able to raise each individual canister’s compute allocation as needed, which would be difficult/cost prohibitive with 1M+ canisters or an MMC architecture.

If you’re curious about canister-subnet limits and want to learn more, a few resources I recommend are:

These forums (great search resource)
Global R&D (highly recommend attending the weekly ones if you’re a developer)
Performance and Scalability working group - meets once a month to discuss a variety of new feature & scalability related topics. I highly recommend attending if you want to learn about the latest protocol features and the best way to scale your app on the Internet Computer. They also record sessions, which you can watch after the fact!

Here’s a meeting recording from the Scalability and Performance working group this past July that dives into many of the soft & hard limits of canisters per subnet, as while touching on how many of these scalability assumptions don’t hold if most of the canisters on the subnet are constantly active & executing computations.

@jamesbeadle from many of the comments this thread, it also seems that there’s a lot of confusion around how to build and scale an app on ICP. And reasonably so, there are many different approaches people have taken (single canister, few canister, MMC/canister per user, etc.) and no clear signal of which approaches scale the best (everyone is biased towards their own solution ).

Maybe something that would help at this point would be to create a space where ICP devs can receive architectural feedback on the design and scalability of their apps from DFINITY engineers. Ideally, DFINITY engineers can provide supporting charts/metrics/data and other developers can learn from these communications.

zenicp · October 8, 2024, 8:42pm

Thanks for sharing the zoom replay link from last month. Are other sessions cataloged somewhere? I’m new to the forum and IC development. A youtube playlist would be a convenient/accessible way to archive and catalog working group sessions.

Topic		Replies	Views
Subnets with heavy compute load: what can you do now & next steps Developers	174	4116	November 26, 2024
Suggested measures to reduce latency and improve ICP scalability Developers	48	1280	November 4, 2024
Voting is now live for a new proposal for the scaling of the world-computer NNS Governance	24	688	April 21, 2025
Technical Working Group: Scalability & Performance Developers Discussing , community-consideration	178	10231	August 19, 2025
ICP.Lab Storage & Scalability Summaries Developers	18	4756	April 9, 2025

LAMENT: A tale of constant struggle of what it's like trying to scale on ICP

Related topics