Subnets with heavy compute load: what can you do now & next steps

High load simply means too many canisters wanting to do too much at the same time for the subnet to be able to do it. Same as your laptop/desktop when you try to play a FPS while the OS is updating and you’re compressing a huge video. Only a lot worse, because we have subnets where literally 20k canisters have something to do all at once.

Right now, you can sort of guesstimate it by the number of updates (which is not foolproof: 4 huge updates per round can fully occupy a subnet); plus whether the subnet is experiencing drops in block rate: because instruction costs are not 100% accurate and because of high contention (lots of canisters doing memory writes; or simply the context switching), a round on a highly loaded subnet often takes more than the 400ms required to maintain a block rate of 2.5 blocks/round; whereas a mostly idle subnet almost always completes rounds in under 400 ms.

We are also looking into exposing more metrics on the public dashboard, such as “scheduler latency” – how many rounds does the average canister have to wait before getting scheduled. FWIW, after the recent scheduler improvements, with the exception of one subnet (bkfrj) this number is consistently below 2 rounds. So on most subnets right now the main source of latency is round duration: whenever a round takes a couple of seconds to complete, that’s a couple of seconds times 3 or 4 (because Consensus runs ahead of Execution; so a backlog of blocks builds up) that your ingress message has to wait before it’s eventually executed. More consistent round durations is also something that is being worked on right now.

Personally, I might prefer the predictability of computation over cycles, as we’re already using CycleOps to manage it. Dynamic pricing feels like a generalization, so with an improved CycleOps, a static policy could still be possible. It’s hard to imagine the IC competing with low-latency blockchains without dynamic pricing. (Thanks for taking feedback over the weekend, by the way!)

1 Like

Then reserve a compute allocation. I’m saying this in all seriousness, no sarcasm, no joking. If you want predictability, get guaranteed resources. Otherwise, the best you can hope for is something like “10x more than the next canister”. Which is not a lot when you’re looking at a subnet with 20k active canisters. And it’s also not “predictable performance”.

Things may improve when we implement automated load balancing via canister migration, but it’s still not going to be the same as guaranteed performance: you may want to limit your canister to certain kinds of subnets (e.g. GDPR; or fiducial) so even with a huge IC you still don’t get to pick among more than a handful of subnets, which may all be busy; and even when a lightly loaded subnet is available, the system will not be able to immediately migrate away all canisters the moment some application hits a huge load spike.

No compute allocation means best-effort. And best-effort means no (or very limited) predictability.

Dynamic pricing is not going to give you guarantees, unless you set up a rule like “pay whatever it takes to give me a full round”. At which point, someone might just bid $1k per message, just to get you to pay $1001. Consider what it would take to guarantee that your transaction is processed by Ethereum or Bitcoin.

I don’t know which low latency blockchins we’re talking about, but most of the ones I’ve seen are glorified ledgers. It’s not an apples-to-apples comparison. The IC is general-purpose decentralized compute.

And sure, we can introduce per-message bidding (we’ve discussed this a number of times). That would make it similar to Ethereum or Bitcoin (and the vast majority of blockchains). You can even implement something like that yourself inside your canister: get a compute allocation (so you don’t have to compete with the rabble) and charge your users (either by forcing them to have an account in your app; or by forcing them to call you through a wallet canister) for the privilege of going first.

I apologize for getting so excited and combative but, as said, either you pay for guaranteed performance (which is very expensive, because you are basically blocking off a large proportion of a subnet for yourself); or you don’t get guarantees. It’s as simple as that. And this is not a shortcoming of the IC (although there are plenty of those). It’s simply impossible to provide “predictable” performance to arbitrarily many canisters with what is necessarily limited resources.

If you want an analogy, this is like saying “I cannot afford to buy a seafront house; I don’t even want to rent an apartment in it for the whole year, because it’s much too expensive; all I want is that, same as everyone else, whenever I decide to rent one of the apartments, it should be free”. No exaggeration.

4 Likes

Is it planned to use more than 4 cores per subnet? That’s the cheapest way to scale if execution rounds being full (by other canisters) is the bottleneck.

4 Likes

It is definitely an option. We didn’t do it because (i) there’s already high contention due to our setup (e.g. all mmap() calls following a page fault go through a single kernel thread; disk encryption is slow and IIRC it also goes through a single thread); and (ii) more concurrent executions means higher peak memory usage (and, believe it or not, between replicated executions, queries, multiple versions of the replicated state, P2P artifact pools and the various overhead, we could easily exceed 512 GB of RAM and if the kernel kills any of our sandboxes, it’s good bye deterministic execution).

There are people looking into all of this, it’s just that it’s not as simple as bumping a constant.

4 Likes

The idea was essentially to make compute allocation more flexible. If I understand correctly, we are currently paying a fixed price regardless of the subnet’s load. I believe that sometimes you can’t get the allocation if the resources are oversubscribed. Ideally, that shouldn’t be the case.

A compute allocation is a hard guarantee. It’s quite expensive, but you will get that percentage of a CPU core regardless of subnet load (e.g. if you pay for 50% compute allocation, you get scheduled at least 1 round out of every 2).

@free @Manu thank you for being frank here :pray:

I know that the (DFINITY) system dev/integration team is very experienced and knowledgeable but still (because IMO this is the most important topic for cloud services provider - the abstracted functionality (most likely) won’t be more effective and reliable than the underlying layer/system) - from my experience more heads (with different background) know more - mainly diversity in experience/approach can be very convenient for solving such complex problems - I want to ask if is there any way how community members can get involved in order to (try to) help in this matter.

E.g., there is a Technical Working Group DeAI (and other groups for specific topics), is there any Technical Working Group Base System?

There’s a Scalability & Performance Technical Working Group. More community involvement is always appreciated, the attendance isn’t all that high usually.

4 Likes

I think the best metric to see if a subnet has a lot of computation load or not is to look at the “Million Instructions Executed Per Second” metric on the subnet page of the dashboard. Many of the busy subnets are > 5 MIEPs, while some others are < 1MIEPs.

2 Likes

The current high loads have really brought to the surface lots of interesting facts on how IC operates under the hood. Going forward, we must now internalise this information and let it affect what types of applications we try to build. I believe the general perception is that you can build and host “anything” on IC, that it is a platform to replace any AWS serverless function or Cloudflare worker. This, clearly, is not the case. Which is ok. It is just different. In some cases this difference is a deal breaker, in many cases you can work around it. Two examples:

  1. Login using SIWE. Signing in with Ethereum requires using threshold ECDSA and thus makes update calls during the login process. If it takes 20 seconds to login that is a 100% dealbreaker. Conclusion: Make sure you app runs on a subnet with low load or make a compute allocation or both. The smallest compute allocation, 1%, costs ca $35 per month, right? Which unfortunately also is a 100% dealbreaker for small open source projects. So, either make sure you have a revenue model from day 1 or try to find a subnet with low load.
  2. A visual note taking application. While the user edits, the changes to the document are saved to the backend canister. In this case you can work with optimistic updates in the UI and do the saves in the background. It is not the end of the world if a save takes 20 seconds as long as you have planned for this wait.

To avoid making app developers really sad, I think the unpredictability needs to be more clearly communicated. There was some change made to the main website just recently, but it would be wise to often (and in many channels) remind app devs that they cannot expect update calls to finish in 2 seconds.

5 Likes

We badly need more transparency on subnet stats. The current stats MEIPs and TX/s aren’t enough. Usually TX/s is a rather flat line of updates per second sometimes in the hundreds. The fact that it is a flat line shows that it must be a high number of heartbeats drowning out the number of ingress messages. Ingress message rate would be expected to follow some recognizable fluctuation pattern over the 24 hour and 7 day periods. We need to be able to see that.

Any additional information would be helpful such as ingress message rate, size of induction pool, percentage to which the execution rounds are filled, etc.

That will help developers understand when, for how long and why their canister’s user experience might have been impacted.

The question is also why we need a dashboard for this. Why can we not query nodes directly and have them export these stats?

10 Likes

It should be possible to get that funded from the community by donations. Doing that would be easier if there was a way to pay directly for compute allocation of another canister instead of sending cycles to the other canister first and then have that canister buy the compute allocation with it. Because it involves two steps and the other canister can potentially withdraw the cycles or do something else with it.

I am mostly interested in the problems that were described in this thread that can not be solved by compute allocation. There seem to be two kinds:

  1. I want a rather small (affordable) compute allocation, say for example 10%. I am worried about latency for my user’s ingress messages, but my own executions are rather short. That is a common scenario for most user facing dapps. Say I have multiple noisy neighbours on the subnet which fill up ingress messages, heartbeats, timers and run long executions. Now let’s look at the ingress latency for my users. Over three consecutive execution rounds the scheduler prioritizes ingress, heartbeats, timers and then starts with ingress again. So ingress is prioritized every 3rd round. If I have 10% allocation my ingress messages will get prioritized every 30 rounds, or around 15 seconds. That is waiting time. The execution time and reading the response is still added to it. That is clearly too long. Even 100% compute allocation will prioritize my ingress messages only every 3rd round which can still be too slow.

I think to support low latency user facing dapps with short executions better we need specialized subnets. We can optimize for the absence of heartbeats/timers and for short executions. For example we can prioritize ingress 9 rounds out of 10 and we can limit the execution time to a lower bound than 2 billion instructions. So that multiple canisters can run per core per round.

  1. My own canister uses long executions. That is fundamentally contradicting low latency responses from my canister. Because a canister is an actor and cannot process two message in parallel. If that is the scenario then I have to move to a multi canister architecture and possibly also multi-subnet architecture. If I have such high execution load I might also have to give up on low latency. And I should be prepared to pay for the required compute allocation or even rent an entire subnet.

The complaints in this thread seem to come from apps in scenario 1. And the root cause of the problem seems to be that we have one-fit-all subnet parameters that simply cannot deliver what these dapps need. Trying to deliver low latency and long executions spanning full rounds in the same subnet is not going to work.

3 Likes

I understand that BEST-EFFORT allocate strategy is a no-go. How much compute allocation should project make generally ? Can you show canister config ?

Not necessarily. On many subnets there are a small number of heartbeats / timers apparently making significant numbers of canister calls. So yeah, the load is flat instead of following a more realistic daily / weekly cycle. But it’s almost all “updates”, not heartbeats or timers. As it turns out, breaking things out by heartbeats vs updates is not all that useful for separating user traffic from automated load. Similarly, an ingress message may only execute a couple of instructions and terminate; or it may make any number of downstream calls (i.e. ingress vs canister calls also doesn’t separate between user traffic and automated traffic).

And finally, there are a couple of subnets where, even though all canisters are executed virtually every round, we still have huge backlogs of ingress messages accumulating (and sometimes timing out after 5 minutes). Which looks very much like “heartbeats via ingress”. I.e. ingress messages are very much not the same as user traffic.

That same information would also be of particular interest to a potetial attacker. If I were trying to DOS a subnet, I would very much like to know what’s the cheapest / most effective way of achieving this or that. While I don’t believe that security through obscurity is a great approach, I also don’t think that exposing detailed subnet metrics is necessarily the smartest way to go about this.

The replicas do have a /metrics endpoint, which is not publicly exposed, partly for the reasons above; and partly because it produces a 2 MB response, that would be rather trivial to abuse.

That round-robin across heartbeats, timers and messages is per-canister, not for the subnet as a whole. This is because a canister may inadvertently (and we had canisters in this situation) have such a heavy heartbeat that all it ever gets to do is execute said heartbeat and nothing else. If your own canister doesn’t run 2B instruction heartbeats (or has no heartbeat at all) you have nothing to worry about.

Canisters can process multiple requests in parallel, as non-replicated queries. Technically, there is no reason (beyond the added complexity and it needing to be implemented) why a canister couldn’t also process multiple replicated queries concurrently.

If you are talking about mutations, then yes, they must be executed sequentially. And you may require sharding, load balancing and guaranteed resources. And long executions do inevitably cause high latency. But there really isn’t a way around this unless we fundamentally redesign canisters and the IC.

As per the above, this has nothing to do with prioritizing heartbeats, timers and messages in a round-robin fashion (unless you’re shooting yourself in the foot). It is simply about too many best-effort canisters trying to execute too many instructions at the same time.

And it’s not even that so much, except for the bkfrj subnet. All other subnet pretty much consistently manage to execute all heartbeats, timers and messages within a round (and sometimes two). It’s just that, because we cannot perfectly predict how much work and particularly how much contention some operation or other is going to cause, when the subnet is running at near-100% capacity, rounds tend to take more than exactly 400ms (which is what would be required for a consistent 2.5 blocks/s). And when that happens, Consensus runs slightly ahead of the Deterministic State Machine, meaning that you now have both rounds that take 1-2 seconds instead of 400 ms; and a backlog of 4-5 blocks to go through; so something like 10 seconds; before a newly subbmitted ingress message can even start being executed.

The Execution team is looking into flattening out spikes in round duration; and we may consider looking into whether it’s possible to hold Consensus back more when the DSM is slow, since a 5 block backlog is not useful in any way.

1 Like

I think it’s important to add that, unless your canister has a lot of work that it can’t do in a single round, it can do all those things in one round, so i think it would still be every 10 rounds, not 30. Separately, of course the canister developer can has the choice to use heartbeat or not. I would recommend everybody not to use heartbeat and instead rely on timers that are as infrequent as possible, to avoid unnecessarily loading your own canister. If you follow that approach, then in most rounds you only have ingress messages to process anyway.

Edit: I see i basically repeated what @free already said above :slight_smile:

2 Likes

I wouldn’t say it’s a no-go. Because compute allocation is very literally guaranteed, there’s only so much of it. I.e. a subnet only has 300 compute allocation (3 virtual threads out of 4) that can be allocated. And, for now, we have less than 40 subnets. So even if everybody had the cycles, it would be impossible for everybody to get guaranteed compute. It’s first come, first served.

As for how much compute allocation you need, that depends on your application. Do you really need single round latency? Unlikely, but if so, then reserve 100 (as in 100% of a virtual CPU core). If you’re fine with e.g. running every 5th round when the subnet is under heavy load, then reserve 20.

1 Like

Please no. :pray::pray::pray:

Autonomous agents have the capacity to vastly outnumber interactive users in the future. Let’s not handicap what could be our biggest burner.

I see this as a huge informational win. It always seemed “too good to be true” and lots of people called bull s— for years and a bunch of us true believers didn’t want to hear it or didn’t want to listen.

We now have some very clear lines about the cost and limits of our magic internet machine that actually make sense and the engineering that has been done is a marvel.

Look. Your wiz bang money printing dapp is going to likely cost you $3500 a month just like if you were using aws. But you don’t have a wiz bang money printing app do you? No you don’t. So until it becomes one you get to operate on pennies in the dollar with t-ecdsa, vet keys, attestations, a native currency, backup, 13x replication, and the other 100 features that make the IC amazing.

Taking this architecture and cost model into account from the beginning will feel much better than the opposite of slowly having every magic bean slowly taken away as reality sets in.

I’ve always hated the marketing sheen of unreality. I like having these raw numbers and operating parameters and evidence. I can work with that.:rocket::rocket::rocket:

9 Likes

The aim is not to put a hard cap on batch / background workloads. It’s simply to balance (not even prioritize, I got a bit carried away there) interactive and batch workloads. So (to take a simplistic example) if you have 100 canisters with ingress messages and 300 canisters with heartbeats, you give 1 CPU core to the former and 3 CPU cores to the latter. Or something along those lines.

Because the problem we were (and to some extent still are) seeing is that heartbeats are basically starving interactive requests. Before the scheduler improvements, we had subnets where it took us thousands of rounds to go over all canisters, mostly executing heartbeats that did no work; while we had backlogs of ingress messages (that we had been able to handle just fine days before) building up into the thousands and timing out. Had the ingress messages only had to compete against other ingress messages, we would have had no user visible issues; and it would have only marginally affected the throughput of heartbeats.

Regardless, this was just a discussion, nothing materialized. We’re still thinking through the alternatives that are open to us, now that we also have a better idea of the issue at hand ourselves.

5 Likes