High User Traffic Incident Retrospective - Thursday September 2, 2021

It’s not a general rate limit, it’s a rate limit per subnet. All replicas on a subnet do the exact same replicated workload (ingress message and canister message execution) and share the same query load (round-robin) so it makes sense that they share the same rate limits.

Let’s hope this never happens again on the IC because it looks really bad given how the technology was touted and how load balanced static-content competitors could of handled this load.

Yes, there’s still a lot of work to do (including optimization) and this the first time we’ve experienced this level of traffic on a single subnet. That being said, the scalability claims were made for the whole of the IC, not for single canisters. There is only so much a single-threaded canister can scale before it hits a wall. (And canisters need to be single-threaded because they need to be deterministic. There may well be more elaborate solutions to that, but the IC is not there yet.)

7 Likes

Not yet. What we’re using internally to monitor NNS canisters is a hand-rolled Prometheus / OpenMetrics /metrics query endpoint where we export timestamped metrics (queries handled, Internet Identities created, etc.). Timestamped because you can’t guarantee which replica you will be hitting and some replicas may be behind. Prometheus will drop samples with duplicated or regressing timestamps, so this solves the issue of counters going back.

Of course, a lot more than that can be provided by the IC directly, but there is data privacy to be taken into consideration and we’re still working down our priority list.

4 Likes

This 1 core limit only applies to this one subnet and (by total coincidence) was introduced earlier this week, due to limitations we hit because of another canister (with updates and queries that read and wrote tens of MB of memory each) and how that interacted with our orthogonal persistence implementation (which uses mprotect to track dirty memory pages; and that we’re already working on optimizing).

The other thing that we’re working on is canister migration (to other subnets), but that’s still in the design stages. We will share a proposal when we have something more solid. Canister migration would have allowed us to move ICPunks to a different subnet, which would not have made a huge difference to ICPunks itself, but would have had less of an impact on other canisters on the subnet (due to there being 64 cores available for execution).

5 Likes

Thank you for being able to provide so much information transparently.

Looking forward to the documentation about These.

I encountered a problem before, there is a choice, 1. Double the storage, such as adding a hashmap, the advantage is that only O(1) complexity when querying. 2. No double the storage, but it is O(N) when querying, need to traverse the array and make some judgments, I don’t know which is better.

1 Like

I don’t think the subnet could have served a lot more update traffic for the ICPunks canister. It was already running into the instructions per round limit (which is there to ensure blocks are executed in a timely fashion, so block rate is kept up; and to prevent one canister from hogging a single CPU core for a very long time while all other canisters have finished processing and all other CPU cores are idle). It could have handled more updates in parallel on other canisters though, had we not introduced this 1 CPU core limit earlier this week (to this one subnet).

As for read-only queries, it could have probably handled about 2x the traffic, but currently the rate limiting is static and it would need to be dynamic (based on actual replica load) in order to scale with the capacity instead of needing to be conservative.

Also it is very interesting to know the relation between adding nodes to subnet and throughput for updates, how many updates can handle one submet?

Adding more nodes to a subnet will not increase update throughput. All replicas execute all updates in the same order, so update throughput would be exactly the same, but latency would be increased (because Gossip and Consensus need to replicate ingress and consensus artifacts to more machines and tail latency increases).

It would increase read-only query capacity though, proportionally to the number of replicas (as each query is handled by a single replica).

5 Likes

Also

  1. Does the current cycles model truly reflect the consumption of physical resources (of course, the cycles model not only needs to consider the consumption of physical resources, but also needs to consider lowering the threshold for developers, etc.), and is there any subsequent adjustment plan? There was an unreasonable gas price setting for an op code in Ethereum, which led to an attack by DDoS.
  2. Is there an accurate description of all operations that consume cycles so that developers can practice best practices.
2 Likes

The cycles model does reflect the cost of physical resources, but while it did involve careful thinking, it is based on test loads and estimates, not real-world traffic (which didn’t exist back when the cycle model was produced). As a result, it is very likely to get adjusted, although no clear plans for that are yet in place.

  1. Is there an accurate description of all operations that consume cycles so that developers can practice best practices.

Here is everything that is currently being charged for and the cycle costs. But I would optimize for performance in general rather than cycle costs specifically, as cycle costs are quite likely to follow performance at some point in the future (as in, everything that’s slow or causes contention or whatever is likely to cost more and the other way around).

5 Likes

This is exactly what I assumed. In that case to achieve write performance, I need to deploy canister to many subnets and do data sharding on the application level. Am I right?

2 Likes

I belive that this could solve the problem. Now the question is how to make proper sharding :slight_smile:

Not necessarily to many subnets, a single subnet might suffice. (Think of subnets as VMs and canisters as single-threaded processes. In this instance.) You can likely scale quite a bit on a single subnet (by adding more canisters) before you need to scale out to other subnets.

Once you start scaling across subnets, if there’s any interaction between canisters (and I would assume there would be at least some for most applications) you have to account for XNet (cross-net) latency, which is more or less 2x ingress latency (the “server” subnet needs to induct the request, the “client” subnet the response). Whereas on a single subnet without a backlog of messages to process this roundtrip communication could even happen within a single round.

3 Likes

The right answer is almost always double the storage. Almost every DB in the world uses some kind of index to speed up queries. If you are relying on iterating through a list(tablescan in DB speak) your app is doomed to crawl sooner or later. Clustered data and an index can get you there for data where the data can be ordered for much less than 2x, but your writes are going to be heavier. Everything has a trade off, but there is usually a trade off to make for your future sanity and your user’s experience.

1 Like

Looking into this, evidence of load shedding on other subnets was discovered. The boundary nodes were running out of file descriptors and thus were unable to open socket connections to serve load. The volume of this was not too high, but it was sufficient to cause the general slowdown you observed. In this case, further investigation is needed to determine what caused this bottleneck (since it could be that threads/connections were piling up on boundary nodes because replicas were being slow to respond).

Load test and tune the rate limits based on more realistic traffic loads.

The boundary nodes will be included in this action item to better tune their performance and prevent this cross-subnet impact.

4 Likes

Thank you for the post!

Could you please give more details on the fact that early warnings (limiting requests) were triggered hours before the actual ‘flood’? More specifically, what was the actual metric and value v* used to limit subnet reqs?

Was v* a configuration mistake? i.e. A ‘shot in the dark’ value? Of course it is understood everything is ‘beta’ and probably there was not time to benchmark/stress test a ‘staging’ subnet, but if this is the case can you please explain why action (*) was not taken hours before the actual hurricane? After all, ICPunks publicly planned the event.

If v* was actually an educated guess and there was no way to increase the limit or increasing the limit would not have helped (as @free suggests), can we get to the conclusion that a single app deployed on a 13 fat-nodes subnet cannot sustain a traffic of ~ 15k reqs/sec (i.e. traffic at 20:00)?

As for the action item “Improve documentation on how to scale decentralized applications on the IC.”
Can you please anticipate how to achieve this? How could icpunks have deployed its code in a more scalable way?

OT
The ICP burned over time chart Charts | ic.rocks does not show any ‘previously unseen’ pattern. I understand that it is an IC global metric, but it seems strange that such an exceptional event has not left a trace there?! @wang
/OT

Thank you

(*) increase the limit or remove the limit to protect other canisters, or liaise with icpunks to consensuate a solution to avoid the ‘difficult’ launch, or…

2 Likes

Traffic was already considerable before 20:00 UTC. Some people got early access to claim their tokens (before it was open to the public, I believe there were 2 previous rounds) and they (plus some eager regular users) were already putting significant load onto the canister before 20:00 UTC. At some point that went over the configured limit, whatever that was, and requests started being dropped.

The boundary node rate limit (which I don’t actually know, but I suppose is something like 1K QPS per subnet) is static. Whatever static limit you set there is bound to be inaccurate, as not all queries (or updates) are equal. E.g. in this case queries were rather lightweight (under 10K Wasm instructions each). Earlier this week and the week before we were seeing queries that were 4-5 orders of magnitude heavier (as in, up to 1-2B instructions, nearly 1 second of CPU time).

Whatever static limit you put in place, it is not going to cover both those extremes. What’s necessary is a feedback loop allowing boundary nodes to limit traffic based on actual replica load.

It took us a couple of days to piece together exactly what happened from metrics and logs. There was no action that could be taken ahead of time, particularly with no insight into how the respective dapp was put together (how many canisters, where, etc.).

A subnet is essentially a (replicated) VM. A canister is (for the purposes of this analysis) a single-threaded process as far as transactions are concerned; and a distributed service for read-only queries.
So a canister can execute (almost) as many transactions as can run sequentially on a single CPU core.

And ~1K read-only QPS (which is what the subnet was handling according to the replica HTTP metrics) is entirely reasonable if you consider (a) that there was no HTTP caching in place (every single query needed to execute canister code) and (b) single-threaded execution (including for queries) was in place on this subnet only, as a temporary mitigation for a different issue (high contention in the signal handler-based orthogonal persistence implementation).

And again, both query and transaction throughput depends on how much processing each requires: returning 1KB of static content is very different from scanning 100 MB of heap to compute a huge response. In that context “15k reqs/sec” is somewhat meaningless.

A few of the things I can think of: query response caching (via HTTP expiration headers), spreading load across multiple canisters and/or subnets, single-canister load testing to see what a realistic expected throughput (based on instruction limits) should be. Some of these would benefit from additional tooling or protocol support, most can already be put in practice.

Queries and query traffic are (for now) free. A canister executing transactions on a single thread can only burn so many cycles within half an hour (which is how long the whole thing took before there was no more need to execute any transactions) even at full tilt.

I hope this answers your questions.

7 Likes

Yes, it helped a lot, especially the technical insights.
The IC is really impressive.

I still cannot understand why nobody warned icpunks at 4pm that pjljw was bound to be killed because of the airdrop, as it was clear from the metrics. Icpunks could have helped in many different ways (maybe even cancelling the airdrop), but I take this was not a purely technical decision.

A few of the things I can think of: query response caching (via HTTP expiration headers), spreading load across multiple canisters and/or subnets, single-canister load testing to see what a realistic expected throughput (based on instruction limits) should be. Some of these would benefit from additional tooling or protocol support, most can already be put in practice.

I think I understand, but does not that mean that the whole burden then is on the app development? I cannot see how the IC can help my app ‘scaling’. Infinite scaling is ‘you can deploy infinite canisters, but!’?

HTTP caching, refactor/rewrite my app ‘spreading load across multiple canisters and/or subnets’ (?microservices?) and local benchmarking are hard.

If I understand correctly, after writing some motoko and dfx deploy, I have a ‘thread’ on some machine, so I’d better heavily optimize if I expect usage, as I cannot stop/start instance to change its type, or pay for a bigger dyno, or deploy k8s autoscaling, or pay for more lambda concurrency, or …?

But maybe dfx can already help?

3 Likes

It’d also be nice to get some info on these boundary nodes, since they are evidently very important.

Where do they run? Who owns them? How many are there?

4 Likes

A number of reasons: we didn’t know anything about the ICPunks architecture or how they planned to handle the load. Just before 20:00 all ingress messages were being processed normally and there was no drop in finalization rate. Some query messages were being dropped, but that’s neither here nor there. ICPunks themselves, even in the absence of metrics, could have tried loading their own website and checking if it was working as expected or not. And finally, no one was specifically looking at dashboards with the goal of identifying and quickly addressing any issues related to the ICPunks launch.

The oncall engineer did get paged and a group of us did get together to try and piece together what was happening, but we were more interested in the health of the subnet and the IC as a whole rather than the ICPunks drop specifically.

I will point out again that the marketing blurb was about the IC itself being “infinitely scalable” (ignoring the “infinitely” bit), not canisters or even dapps. Certainly not trivially scalable by default (as in, you allocate 1000 CPU cores and you can handle 1000x the traffic). Nothing can do that.

I wouldn’t call them hard, but they do require work, yes. E.g. HTTP caching is simply asking yourself “How often is the value returned by this API going to change? Is it going to be different for different users?”. E.g. the ICPunks canister had a “dtop start timestamp” (which returned “2021-09-01 20:00 UTC”), which could have been trivially cached. And the local benchmarking that was necessary was to simply call the “claim” API in a loop from a shell script or whatever and see how many NFTs could be claimed per second. They would have found out that number was ~20 and could have optimized the respective code.

I’m not placing blame on ICPunks, BTW. Just pointing out that with some decent best practices in place (which is our job to come up with) a canister developer can significantly improve their dapp with comparatively little work.

There is no library or dashboard where you can tweak a knob that will result in N instances of your canister being immediately deployed (there should be one; or more). But it’s also not impossibly hard to do so. All you need is a wallet canister (per subnet; something else that needs improvement) that you can then ask to create as many canisters as you need and you can install your Wasm code into each such canister.

Of course you need to figure out how exactly you’re going to spread load across said canisters (e.g. load 1K NFTs into each canister and have some Javascript in your FE pick one of them at random; or something more elaborate). But you don’t have to configure a replicated data store, integrate with a CDN; or set up firewalls. And then maintaining those.

Boundary nodes are basically proxies. There is only a handful of them, managed by Dfinity, and they run in some of the same independent datacenters where the replicas run. More importantly though, anyone can run their own. I know Fleek does (it’s how they manage to have custom domain names for their clients), but I don’t know the details.

3 Likes

If you run your own boundary node, wouldn’t you need to set up your own DNS entry to point to that boundary node?

I’m guessing a URL like https://h5aet-waaaa-aaaab-qaamq-cai.raw.ic0.app/ points to a boundary node, which then forwards the call to the actual IC replica. What I’m unclear about is how the boundary node knows what the domain name (or IP address) of the replica is.

1 Like

The boundary node knows the IP addresses of the NNS replicas. These all host identical copies of the Registry canister (among others) and the Registry lists all subnets, all nodes on each subnet and the mapping of canister ID ranges to subnets. h5aet-waaaa-aaaab-qaamq-cai is a canister ID (DSCVR, in this case) hosted on the pae4o-... subnet. The boundary node then forwards the request to one of the replicas that make up the pae4o subnet. That’s all there is to it, really.

2 Likes

Oh interesting, I didn’t know there was a registry canister.

Is there any documentation on how to run your own boundary node? In Ethereum, anyone can run their own node (or they can choose to use a centralized node provider like Infura, which in this case is the Dfinity-managed boundary nodes I guess?).