Subnets with heavy compute load: what can you do now & next steps

Manu · October 15, 2024, 12:32pm

The cost is 10M cycles per second on a 13 node subnet. That’s roughly equal to 26T cycles per month, or 35 USD. You need to make sure that your canister holds more cycles then it needs to pay for the compute allocation for the freezing threshold. So if your freezing threshold is X seconds, and you want to set a Y% compute allocation on a 13 node subnet, your canister must at least have XY10M cycles. It is of course smart to make sure it has some extra cycles such that it wouldn’t freeze shortly after setting the compute allocation.

skilesare · October 15, 2024, 12:35pm

This proposal seems a bit odd and off without specificts. The issue isn’t even compute. I’m sure some things can be tweaked, but the clogged subnet issue is a result of the number of canisters on a subnet at the moment. They could all be spending 1 cycle during their turn and converting them to spending 3x for that cycle doesn’t fix anyone’s problem.

If some scheduler issues are fixed things may get a bit better, with scheduling throughput.

I’m pretty sure you’d have to increase compute costs well above 83x to actually make people think twice about choosing the architecture that is causing this. And yes…if timers become 83x more expensive very few will use them. But the more features you add to this mix…suddenly you wake up and your platform is more expensive than anyone can justify.

Manu · October 15, 2024, 12:42pm

Update on the new replica version

The new replica version has reached the first busy subnet, fuqsr, and I’m happy to report that the first signs are looking very good. We have not seen any expired ingress messages on that subnet in the hour that it is running the new version.

You can see on the release page of the dashboard which subnets have received the new replica version. If you use canisters on one of the upgraded subnets, please let us know if you notice an improvement.

skilesare · October 15, 2024, 1:26pm

My friend, I’m afraid this situation has exposed a very rough proposition for you if on-chain AI is your goal. I’m pretty sure that for anything useful that is guaranteed to compute every round and across rounds, you’re going to end up having to do the equivalent of compute allocation 100 which gets you to $3500/month really quickly and will demand some kind of extraordinary use case. You can always rent your own subnet to ensure you have priority, but I think that’s going to limit you to 4 concurrent processes for >$16k/month

I think DTS gets some kind of special priority once it gets going, but the latency to kick off your process is going to require something pretty high.

Manu · October 15, 2024, 1:52pm

I personally see it much more positively actually. I really believe that we can get to a place where only dapps with very strong availability requirements need to use compute allocations, and that dapps without compute allocations typically still are responsive. I also think we wouldn’t need huge pricing changes for that. There is not a big problem right now with eg AI workloads, where you do a significant chunk of computation in one go. Small canisters and small messages are harder for the system. I fully agree that this needs to be worked out in more detail, but I hope and expect that we can land in a good spot with replica improvements and targeted price changes that don’t make things hugely more expensive overall.

frederico02 · October 15, 2024, 2:47pm

Thanks for the update Manu, much appreciated. We also have a canister on that subnet and it does appear to not timeout so far so well done

icpp · October 15, 2024, 2:53pm

Off course, today’s protocol will not lead to anything beyond tiny AIs doing some fun things, but I expect the protocol to change to enable experiences with larger AIs.

TusharGuptaMm · October 15, 2024, 3:33pm

Hi @Manu, thank you for sharing the update about the new replica version. At RuBaRu, we’ve been very thoughtful about balancing costs and our project vision when selecting the architecture (on-demand scaling, multi-canister approach, with multiple users sharing the same canister) for building the fully OnChain Creator ecosystem. Since we store rich media like videos, audios, and images directly on-chain, we anticipate significant storage demands as a social media platform. However, we expect more query calls than updates.

I’m curious if there are any anticipated cost escalations for storage that could help us better estimate the scaling costs. From previous discussions, it seems computation costs may rise, while storage costs could remain steady.

This information would be invaluable for us in accurately calculating and forecasting future requirements.

TIA

Lorimer · October 15, 2024, 5:03pm

This is excellent news! I’m looking forward to see how this progresses as the changes are rolled out to more subnets. Well done everybody at DFINITY!

Mar · October 15, 2024, 5:06pm

It feels super fast for the moment (subnet lhg73). Congrats!

xalkan · October 15, 2024, 5:07pm

Any idea when yinp6-35cfo-wgcd2-oc4ty-2kqpf-t4dul-rfk33-fsq3r-mfmua-m2ngh-jqe will be updated?

icpp · October 15, 2024, 6:09pm

Back in business ! Thank you

Borlay · October 15, 2024, 6:16pm

it sounds a bit strange to ask people of small dapps to pay more just because some Yral puts heavy load… Why AWS don’t ask more money from me because instagram uses a lot of resources?

Why would we need cost increase if nodes running profitable? Cost increase only makes sense if nodes losing money. Other than that we end up in same situation except devs who build will pay more or projects like Yral will move away, which not good at all.

If you increase price then just because Yral gonna pay more that will not help for me to get to run my app in any way. I will end up paying more and still with no resources.

I see here only one potential cost increase is for hearbeats. Make everything that runs in that hearbeats cost 10x more than usual. That devs will think twice before launching 100k canisters with hearbeats in it.

If that causes problems, that should be solved, but not everything else.

Other thing is that we clearly need some kind of sharding. We can’t run heavy app on single node… I mean yes there is 13 nodes, but if all 13 nodes hold same data, then it’s same as running everything on 1 node and others just a backup.

Yes we have subnets for that, but subnets would only make sense if that would be some kind of automatic, like you create canister and it will end up on subnet that is least used and if subnet becomes too busy some canisters automatically will be moved to other subnets without any manual interaction.

Other thing is how query calls different from hearbeats? Do hearbeats run on all 13 nodes for same canister but query call will be run only on 1 node at the time? That makes sense to make hearbeats cost way more than usual query call which is for free i believe?

And why do hearbeats need to run every beat? Maybe some canisters need to run hearbeats every 10 minutes only? Make that canister interface provide period at which it will be run. And if cost of that hearbeat will increase devs will choose wisely how often they want to run it.

Manu · October 15, 2024, 6:47pm

are you experiencing any issues on yinp6? From the metrics it looks like it is not overloaded at all.

sea-snake · October 15, 2024, 6:48pm

If I remember correctly, it’s already possible with timers to do so, heartbeats are the older approach that was available before timers were introduced back at the end of 2022.

As for why timers and heartbeats are update calls that go through consensus, any call that modifies state goes through consensus and is thus an update call.

Query calls do not update state and thus do not need consensus, doing query calls on a heartbeat or timer wouldn’t be very useful since they wouldn’t update the state.

Borlay · October 15, 2024, 6:54pm

That makes sense, why would you need query call on a timer.

icme · October 15, 2024, 6:54pm

Well done ! Confirming that we haven’t seen any expired ingress messages on our side over the past 6 hours on 3hhby

xalkan · October 15, 2024, 7:02pm

Yes, the ingress expiry issue.

TusharGuptaMm · October 16, 2024, 6:42am

That seems to have resolved the issue! Our DApp is back to normal and working smoothly. We’ll keep an eye on it with some extra load testing.

Thanks for @Manu and team!

rbole · October 16, 2024, 7:08am

Will this fix also work on the European Subnet ??

Topic		Replies	Views
Suggested measures to reduce latency and improve ICP scalability Developers	48	1263	November 4, 2024
Sometimes canister function no response Motoko	4	32	October 12, 2024
The contract I developed has been released, but the execution update is very slow? Developers Discussing	3	66	October 15, 2024
Canister unreachable Developers	2	39	October 1, 2024
Thoughts on Subnet Optimization and Scalability for Future Growth General	1	52	May 15, 2025

Subnets with heavy compute load: what can you do now & next steps

Related topics