LAMENT: A tale of constant struggle of what it's like trying to scale on ICP

Manu · October 10, 2024, 9:21am

That would surprise me tbh, do you have a canister id for me?

frederico02 · October 10, 2024, 9:39am

iyehc-lqaaa-aaaap-ab25a-cai.

we have canister logs to show the times of when it started work and finished work.

it started at Oct 9, 2024, 3:11:41 PM UTC
it finished it’s work at Oct 10, 2024, 2:50:01 AM UTC
roughly 12 hours but it’s more like a 11 hour delay when you factor in the time it takes to do the actual work.

skilesare · October 10, 2024, 10:35am

My guess is that you do this work with a timer over many rounds? And that round may call a canister on the same/or simally contested subnet to get some value or distribute data? If you don’t increase your compute allocation on all of the involved canisters, then yes, you will see this behavior.

Example:
Canister A sets timer for next round
Canister A is not scheduled for 23 minute.
23 minutes later: Canister A activated and does work but awaits canister B for a value to finish work.
Canister B is not scheduled for up to 23 minutes.
Up to 23 minutes later(+46): Canister B processes and returns value to A
Up to 23 minutes later(+69), Canister A is scheduled to process the request and does loop 2.

If you do a few of these loops you end up with a very long wait.

If this work can be parallelized at all you should do so. Initiate all the calls with a future and then await them all at the same time. Make sure you have enough cycle to cover all the potential calls(up to 500).

I’m guessing your work is minting or sending tokens to a group of accounts? Parallelize it.

See this thread: All or nothing batch transaction ICRC standard? - #14 by icme

frederico02 · October 10, 2024, 11:01am

The primary work of the canister is to distribute rewards. the only other canister it interacts with is a ledger canister. It is already doing work in parallel to some degree. We calculate all the transfers that need to happen and then do batches of 45 icrc1_transfer calls to the ledger. the canister was developed around 4/5 months ago and at the time we found doing more than 50 calls resulted in some errors related to making too many calls. Perhaps we could make it do 500 and just retry the failed errors but last time i did that i ended up so many fails that it took more time to retry the errors than if i had just batched them at 45.

If the ledger canister timed out some of our icrc1_transfer calls then we should’ve seen errors in our reward payments but we didn’t, they all went through so i think that the ledger canister was actually performing well. We also noticed that once the transfers started to work they all went through in the normal time it would take - this would indicate that subsequent rounds were being processed in a timely manner wouldn’t it? It seems like there was just a huge 11 hour delay and then all the rounds got processed like normal.

Is it safe to parallelize up to 500 icrc1_transfers now?

Manu · October 10, 2024, 11:16am

I see, yeah so if you interact with another canister then it makes more sense. Your canister must definitely get scheduled more frequently than once in 11 hours, but if you make multiple calls to another canister and need to await the result (which means getting scheduled again), I understand how the whole task can take a long time.

frederico02 · October 10, 2024, 11:20am

that sounds more correct. I guess the only thing we can do is try to bump the batch amount from 45 to close to 500.

skilesare · October 10, 2024, 12:26pm

You can fill the outgoing queue with up to 500 calls, but you need the right number of cycles in reserve. Each reserves 20B. You get them back when the calls complete(minus the actual execution cost which is far far lower), but the system has to reserve them because it doesn’t know how long it needs to hold the call context for.

500*20000000000 = 10000000000000 =10_000_000_000_000 = 10 T cycles.

So Make sure you have at least 10 T cycles + your freezing threshold(I’d double it just to be safe) and you should be able to up your parallelization to close to 500. I’d leave some room in case your canister needs to do something else.

Also, if this is a custom ledger, consider possibly moving it to the @PanIndustrial Ledger at GitHub - PanIndustrial-Org/ICRC_fungible: A full implementation of an ICRC 1,2,3 compatible fungible token which implements ICRC-4 and you could do hundreds in one call.

frederico02 · October 10, 2024, 1:34pm

Thanks for the recommendations, this should also be a nice boost for us <3

frederico02 · October 11, 2024, 10:28am

A lot of our staging canisters are affected by the problem at large too. We will also have to add a lot of cycles just to test our application works. I’m wondering how this will affect smaller developers without our budget.

infu · October 11, 2024, 12:46pm

My knowledge of how the scheduler works is based on @skilesare 's comment.

Let’s see what the engineers working on it will figure out. I also have the same scheduling problem with what I am working on, so I am interested in how this gets solved, but we are now far away from the original topic and should move to another thread.
I suppose selecting canisters with multiple algorithms at the same time will make it feel faster. Each algo with a different % of the unreserved computation power. 1) using cycle balances 2) round robin 3) canister age (against DoS) 4) least amount of cycles used. Where (4) is going to make devs who write optimized apps that don’t get used a lot - happy.

zachdotcom · October 13, 2024, 3:53am

Did they make it through?? Sounds like they haven’t found a solution yet.

Gwojda · October 16, 2024, 8:53am

Hey guys,
Seems the release Dfinity done yesterday fixed all our issue on our side. We are back with compute allocation 0, and update request are working correctly.
Thanks dfinity for your work here.

Jdcv97 · October 17, 2024, 4:50pm

Some thoughts of people due to your reply and the way you handle this situation @Manu

Highly supporter of your job and you knowledge, but sometimes you don’t answer properly to when concerns come from builders on the network.

Instead of telling developers that they are doing wrong you DFINITY should be all day all night working to fix this limitations and deliver a production product faster, that’s fine if the dapp builder it’s dapp can’t scale due to not using the correct architecture, and in that situation you can tell them don’t use a single canister per user, BUT ONCE this approach taken by the dapp builder AFFECTS THE NETWORK ITSELF you can’t come with that answer because the network should be able to adapt and doesn’t get affected by what other people do in it. So you mean you are gonna say to an attacker,

“hey don’t use a canister per user because the network will slow down and this will affect us” NO THATS NOT THE ANSWER, the network should handle that properly, so if what yral is doing wouldn’t affect other dapps on the subnet and the subnet itself is ok your answer, but once this affects everyone else your answer is simply not professional.

@dominicwilliams @Jan people are worried about replies of some engineers working at dfinity some times. Thanks

devabcd · November 3, 2024, 3:50pm

Reflecting on the network limitation discussions… I think a key takeaway is the need for clear guidance. To support builders and ensure a harmonious network, I strongly recommend:

Developing a Comprehensive Best Practice Guide for building scalable dapps on DFINITY, covering architecture, optimization, and network considerations.

This resource would empower devs to create successful apps while minimizing potential network impacts.

Topic		Replies	Views
Proposal idea : short term fix to the scaling issue General	2	137	October 8, 2024
Thoughts on Subnet Optimization and Scalability for Future Growth General	1	52	May 15, 2025
Path forward for subnet splitting and protocol scaling Developers	19	361	October 17, 2024
Voting is now live for a new proposal for the scaling of the world-computer NNS Governance	24	681	April 21, 2025
Canister Load Balancing (Community Consideration) Roadmap	5	1358	September 16, 2021

LAMENT: A tale of constant struggle of what it's like trying to scale on ICP

Related topics