High User Traffic Incident Retrospective - Thursday September 2, 2021

The right answer is almost always double the storage. Almost every DB in the world uses some kind of index to speed up queries. If you are relying on iterating through a list(tablescan in DB speak) your app is doomed to crawl sooner or later. Clustered data and an index can get you there for data where the data can be ordered for much less than 2x, but your writes are going to be heavier. Everything has a trade off, but there is usually a trade off to make for your future sanity and your user’s experience.

1 Like

Looking into this, evidence of load shedding on other subnets was discovered. The boundary nodes were running out of file descriptors and thus were unable to open socket connections to serve load. The volume of this was not too high, but it was sufficient to cause the general slowdown you observed. In this case, further investigation is needed to determine what caused this bottleneck (since it could be that threads/connections were piling up on boundary nodes because replicas were being slow to respond).

Load test and tune the rate limits based on more realistic traffic loads.

The boundary nodes will be included in this action item to better tune their performance and prevent this cross-subnet impact.

4 Likes

Thank you for the post!

Could you please give more details on the fact that early warnings (limiting requests) were triggered hours before the actual ‘flood’? More specifically, what was the actual metric and value v* used to limit subnet reqs?

Was v* a configuration mistake? i.e. A ‘shot in the dark’ value? Of course it is understood everything is ‘beta’ and probably there was not time to benchmark/stress test a ‘staging’ subnet, but if this is the case can you please explain why action (*) was not taken hours before the actual hurricane? After all, ICPunks publicly planned the event.

If v* was actually an educated guess and there was no way to increase the limit or increasing the limit would not have helped (as @free suggests), can we get to the conclusion that a single app deployed on a 13 fat-nodes subnet cannot sustain a traffic of ~ 15k reqs/sec (i.e. traffic at 20:00)?

As for the action item “Improve documentation on how to scale decentralized applications on the IC.”
Can you please anticipate how to achieve this? How could icpunks have deployed its code in a more scalable way?

OT
The ICP burned over time chart Charts | ic.rocks does not show any ‘previously unseen’ pattern. I understand that it is an IC global metric, but it seems strange that such an exceptional event has not left a trace there?! @wang
/OT

Thank you

(*) increase the limit or remove the limit to protect other canisters, or liaise with icpunks to consensuate a solution to avoid the ‘difficult’ launch, or…

2 Likes

Traffic was already considerable before 20:00 UTC. Some people got early access to claim their tokens (before it was open to the public, I believe there were 2 previous rounds) and they (plus some eager regular users) were already putting significant load onto the canister before 20:00 UTC. At some point that went over the configured limit, whatever that was, and requests started being dropped.

The boundary node rate limit (which I don’t actually know, but I suppose is something like 1K QPS per subnet) is static. Whatever static limit you set there is bound to be inaccurate, as not all queries (or updates) are equal. E.g. in this case queries were rather lightweight (under 10K Wasm instructions each). Earlier this week and the week before we were seeing queries that were 4-5 orders of magnitude heavier (as in, up to 1-2B instructions, nearly 1 second of CPU time).

Whatever static limit you put in place, it is not going to cover both those extremes. What’s necessary is a feedback loop allowing boundary nodes to limit traffic based on actual replica load.

It took us a couple of days to piece together exactly what happened from metrics and logs. There was no action that could be taken ahead of time, particularly with no insight into how the respective dapp was put together (how many canisters, where, etc.).

A subnet is essentially a (replicated) VM. A canister is (for the purposes of this analysis) a single-threaded process as far as transactions are concerned; and a distributed service for read-only queries.
So a canister can execute (almost) as many transactions as can run sequentially on a single CPU core.

And ~1K read-only QPS (which is what the subnet was handling according to the replica HTTP metrics) is entirely reasonable if you consider (a) that there was no HTTP caching in place (every single query needed to execute canister code) and (b) single-threaded execution (including for queries) was in place on this subnet only, as a temporary mitigation for a different issue (high contention in the signal handler-based orthogonal persistence implementation).

And again, both query and transaction throughput depends on how much processing each requires: returning 1KB of static content is very different from scanning 100 MB of heap to compute a huge response. In that context “15k reqs/sec” is somewhat meaningless.

A few of the things I can think of: query response caching (via HTTP expiration headers), spreading load across multiple canisters and/or subnets, single-canister load testing to see what a realistic expected throughput (based on instruction limits) should be. Some of these would benefit from additional tooling or protocol support, most can already be put in practice.

Queries and query traffic are (for now) free. A canister executing transactions on a single thread can only burn so many cycles within half an hour (which is how long the whole thing took before there was no more need to execute any transactions) even at full tilt.

I hope this answers your questions.

7 Likes

Yes, it helped a lot, especially the technical insights.
The IC is really impressive.

I still cannot understand why nobody warned icpunks at 4pm that pjljw was bound to be killed because of the airdrop, as it was clear from the metrics. Icpunks could have helped in many different ways (maybe even cancelling the airdrop), but I take this was not a purely technical decision.

A few of the things I can think of: query response caching (via HTTP expiration headers), spreading load across multiple canisters and/or subnets, single-canister load testing to see what a realistic expected throughput (based on instruction limits) should be. Some of these would benefit from additional tooling or protocol support, most can already be put in practice.

I think I understand, but does not that mean that the whole burden then is on the app development? I cannot see how the IC can help my app ‘scaling’. Infinite scaling is ‘you can deploy infinite canisters, but!’?

HTTP caching, refactor/rewrite my app ‘spreading load across multiple canisters and/or subnets’ (?microservices?) and local benchmarking are hard.

If I understand correctly, after writing some motoko and dfx deploy, I have a ‘thread’ on some machine, so I’d better heavily optimize if I expect usage, as I cannot stop/start instance to change its type, or pay for a bigger dyno, or deploy k8s autoscaling, or pay for more lambda concurrency, or …?

But maybe dfx can already help?

3 Likes

It’d also be nice to get some info on these boundary nodes, since they are evidently very important.

Where do they run? Who owns them? How many are there?

4 Likes

A number of reasons: we didn’t know anything about the ICPunks architecture or how they planned to handle the load. Just before 20:00 all ingress messages were being processed normally and there was no drop in finalization rate. Some query messages were being dropped, but that’s neither here nor there. ICPunks themselves, even in the absence of metrics, could have tried loading their own website and checking if it was working as expected or not. And finally, no one was specifically looking at dashboards with the goal of identifying and quickly addressing any issues related to the ICPunks launch.

The oncall engineer did get paged and a group of us did get together to try and piece together what was happening, but we were more interested in the health of the subnet and the IC as a whole rather than the ICPunks drop specifically.

I will point out again that the marketing blurb was about the IC itself being “infinitely scalable” (ignoring the “infinitely” bit), not canisters or even dapps. Certainly not trivially scalable by default (as in, you allocate 1000 CPU cores and you can handle 1000x the traffic). Nothing can do that.

I wouldn’t call them hard, but they do require work, yes. E.g. HTTP caching is simply asking yourself “How often is the value returned by this API going to change? Is it going to be different for different users?”. E.g. the ICPunks canister had a “dtop start timestamp” (which returned “2021-09-01 20:00 UTC”), which could have been trivially cached. And the local benchmarking that was necessary was to simply call the “claim” API in a loop from a shell script or whatever and see how many NFTs could be claimed per second. They would have found out that number was ~20 and could have optimized the respective code.

I’m not placing blame on ICPunks, BTW. Just pointing out that with some decent best practices in place (which is our job to come up with) a canister developer can significantly improve their dapp with comparatively little work.

There is no library or dashboard where you can tweak a knob that will result in N instances of your canister being immediately deployed (there should be one; or more). But it’s also not impossibly hard to do so. All you need is a wallet canister (per subnet; something else that needs improvement) that you can then ask to create as many canisters as you need and you can install your Wasm code into each such canister.

Of course you need to figure out how exactly you’re going to spread load across said canisters (e.g. load 1K NFTs into each canister and have some Javascript in your FE pick one of them at random; or something more elaborate). But you don’t have to configure a replicated data store, integrate with a CDN; or set up firewalls. And then maintaining those.

Boundary nodes are basically proxies. There is only a handful of them, managed by Dfinity, and they run in some of the same independent datacenters where the replicas run. More importantly though, anyone can run their own. I know Fleek does (it’s how they manage to have custom domain names for their clients), but I don’t know the details.

3 Likes

If you run your own boundary node, wouldn’t you need to set up your own DNS entry to point to that boundary node?

I’m guessing a URL like https://h5aet-waaaa-aaaab-qaamq-cai.raw.ic0.app/ points to a boundary node, which then forwards the call to the actual IC replica. What I’m unclear about is how the boundary node knows what the domain name (or IP address) of the replica is.

1 Like

The boundary node knows the IP addresses of the NNS replicas. These all host identical copies of the Registry canister (among others) and the Registry lists all subnets, all nodes on each subnet and the mapping of canister ID ranges to subnets. h5aet-waaaa-aaaab-qaamq-cai is a canister ID (DSCVR, in this case) hosted on the pae4o-... subnet. The boundary node then forwards the request to one of the replicas that make up the pae4o subnet. That’s all there is to it, really.

2 Likes

Oh interesting, I didn’t know there was a registry canister.

Is there any documentation on how to run your own boundary node? In Ethereum, anyone can run their own node (or they can choose to use a centralized node provider like Infura, which in this case is the Dfinity-managed boundary nodes I guess?).

Understood.

some decent best practices in place (which is our job to come up with) a canister developer can significantly improve their dapp with comparatively little work

To this regard, where one should expect to find this information?
I struggle to divide the different info sources (dfinity library, sdk web, github repos, medium posts, reddit AMAs,…) into ‘domains’, and I cannot see this forum as a scalable (pun intended) place for ‘IC deep engineering’ docs @diegop

Many thanks for your time @free

2 Likes

By the way. Are there any possibility to set which subnet to use during creating/deploying canister? Right now I can create as many canister as I need but I cannot control which subnet will be used for canister. Or I missed something? Who is define which subnet will be used while creating a new canister?
A good post how to build a scalable application on the dfinity whould be great!

1 Like

Ignoring update calls for now, I would love to see the IC itself provide infinite scaling of query calls for canisters. Maxing out the cores on individual replicas will help, but could the IC manage backup read-only replicas that are ready to be deployed ad hoc into subnets experiencing spikes in query traffic? I imagine this would be relatively simple to do, at least theoretically.

As an example, imagine the ICPunks drop. Assuming there were 7 full replicas in the subnet, as query traffic began to approach certain limits, the subnet would request extra read replicas. These could be relatively quickly added into the subnet, using the catch-up package functionality to quickly get the replica onboard. It wouldn’t particpate in consensus, it would be a read-only replica.

As traffic continued to increase, read-only replicas could continue to be added. The IC would have to maintain a pool of these replicas, always ready to be deployed where needed. Once traffic died down, the replicas would be returned to the pool. If the traffic never died down, perhaps the replica would become a permanent part of the subnet.

So subnets might have a fixed number of full consensus replicas, and some number of read-only replicas. This would not slow down consensus, but would scale out query calls infinitely without the developer needing to do anything fancy (even a single canister would automatically scale out queries).

Please consider this, I think it would be a powerful capability.

7 Likes

@lastmjs fwiw, I passed your suggestions along to R&D for wider visibility, to make sure more people saw it.

1 Like

@lastmjs Thanks for the suggestion! Yes it is a very good idea! We had similar thoughts in the same direction, but didn’t prioritize because there are other quicker ways of improving query performance. The priority was (and I believe still is) exploring ways to maximize the utility of existing hardware.

Just a quick update on what we have discovered and fixed:

  1. Boundary nodes were running out of file descriptors which caused many 500s. This has been fixed.
  2. Another bug was identified that caused boundary nodes to still reach out to defective subnet nodes and then return 502s to user (which causes CORS error). The fix will be rolled out tomorrow.
  3. We are still working on improving multithread query execution on subnet nodes. Making progress, but not final yet.
  4. We are working on caching at system level for query calls. This should drastically reduce loads on query execution especially during hot spot events like this. Whether this should be done at boundary nodes side or subnet nodes side is still being explored.
  5. We are still trying to understand a good balance on rate limiting. Parameters have not been adjusted yet.

There are a couple things a canister developer can do to help improve usability:

  1. Implement http_request call with custom response header to control cache expiry.
  2. Code defensively when using query calls. Catch errors & try again later instead of waiting for user to reload the whole page (which makes congestion situation worse).
  3. Understand that a query call may not always execute against the latest state. Some user saw that “remaining punks” number rolled back which gave a bad impression, but this can be avoided by adding a bit of client side logic.

Please watch this space for more updates! Thanks!

6 Likes

Another idea I have, just to throw it out there:

I think we can improve asset canister to index assets by sha256 hash (or some shorter version). This can be optional, but when coupled with webpack to properly rewrite links in HTML, it can offer immutable assets, or in other words, the asset canister can serve them with a very long expiry timestamp. This may help quite a bit with relatively no effort on developer side.

What do people think?

3 Likes

Thanks for the great work!!

1 Like

I have noticed that the performance of some applications on the Internet Computer has been degraded over the weekend (at times making some of them unusable). The problems seemed to start around the ICPBunnies “testnet” yesterday, but have lingered since. Is this caused by the same issues surrounding the ICPunks launch earlier this month?

1 Like

Thanks for the heads up! Im am not aware of anything, so let me ping folks to see if they see something

2 Likes

I’m not aware of any noticeable traffic or workload over the weekend. One of the subnet pjljw (the same one that ICPunks was on) has a long standing problem that required a temporary fix to restrict its query execution to 1 thread. We are aware of the degraded performance, but it shouldn’t affect other subnets, unless ICBunnies happens to be on the same subnet, which would be very unforutnate.

That said, we have a fix (that is tested have very good improvement) ready to deploy to this subnet (hopefully tomorrow) via NNS proposals. We also have query caching implemented and currently under going internal testing. So we should expect IC subnets to have significant improvements at handling user traffic very soon.

8 Likes