High User Traffic Incident Retrospective - Thursday September 2, 2021

By the way. Are there any possibility to set which subnet to use during creating/deploying canister? Right now I can create as many canister as I need but I cannot control which subnet will be used for canister. Or I missed something? Who is define which subnet will be used while creating a new canister?
A good post how to build a scalable application on the dfinity whould be great!

1 Like

Ignoring update calls for now, I would love to see the IC itself provide infinite scaling of query calls for canisters. Maxing out the cores on individual replicas will help, but could the IC manage backup read-only replicas that are ready to be deployed ad hoc into subnets experiencing spikes in query traffic? I imagine this would be relatively simple to do, at least theoretically.

As an example, imagine the ICPunks drop. Assuming there were 7 full replicas in the subnet, as query traffic began to approach certain limits, the subnet would request extra read replicas. These could be relatively quickly added into the subnet, using the catch-up package functionality to quickly get the replica onboard. It wouldn’t particpate in consensus, it would be a read-only replica.

As traffic continued to increase, read-only replicas could continue to be added. The IC would have to maintain a pool of these replicas, always ready to be deployed where needed. Once traffic died down, the replicas would be returned to the pool. If the traffic never died down, perhaps the replica would become a permanent part of the subnet.

So subnets might have a fixed number of full consensus replicas, and some number of read-only replicas. This would not slow down consensus, but would scale out query calls infinitely without the developer needing to do anything fancy (even a single canister would automatically scale out queries).

Please consider this, I think it would be a powerful capability.

7 Likes

@lastmjs fwiw, I passed your suggestions along to R&D for wider visibility, to make sure more people saw it.

1 Like

@lastmjs Thanks for the suggestion! Yes it is a very good idea! We had similar thoughts in the same direction, but didn’t prioritize because there are other quicker ways of improving query performance. The priority was (and I believe still is) exploring ways to maximize the utility of existing hardware.

Just a quick update on what we have discovered and fixed:

  1. Boundary nodes were running out of file descriptors which caused many 500s. This has been fixed.
  2. Another bug was identified that caused boundary nodes to still reach out to defective subnet nodes and then return 502s to user (which causes CORS error). The fix will be rolled out tomorrow.
  3. We are still working on improving multithread query execution on subnet nodes. Making progress, but not final yet.
  4. We are working on caching at system level for query calls. This should drastically reduce loads on query execution especially during hot spot events like this. Whether this should be done at boundary nodes side or subnet nodes side is still being explored.
  5. We are still trying to understand a good balance on rate limiting. Parameters have not been adjusted yet.

There are a couple things a canister developer can do to help improve usability:

  1. Implement http_request call with custom response header to control cache expiry.
  2. Code defensively when using query calls. Catch errors & try again later instead of waiting for user to reload the whole page (which makes congestion situation worse).
  3. Understand that a query call may not always execute against the latest state. Some user saw that “remaining punks” number rolled back which gave a bad impression, but this can be avoided by adding a bit of client side logic.

Please watch this space for more updates! Thanks!

6 Likes

Another idea I have, just to throw it out there:

I think we can improve asset canister to index assets by sha256 hash (or some shorter version). This can be optional, but when coupled with webpack to properly rewrite links in HTML, it can offer immutable assets, or in other words, the asset canister can serve them with a very long expiry timestamp. This may help quite a bit with relatively no effort on developer side.

What do people think?

5 Likes

Thanks for the great work!!

1 Like

I have noticed that the performance of some applications on the Internet Computer has been degraded over the weekend (at times making some of them unusable). The problems seemed to start around the ICPBunnies “testnet” yesterday, but have lingered since. Is this caused by the same issues surrounding the ICPunks launch earlier this month?

1 Like

Thanks for the heads up! Im am not aware of anything, so let me ping folks to see if they see something

2 Likes

I’m not aware of any noticeable traffic or workload over the weekend. One of the subnet pjljw (the same one that ICPunks was on) has a long standing problem that required a temporary fix to restrict its query execution to 1 thread. We are aware of the degraded performance, but it shouldn’t affect other subnets, unless ICBunnies happens to be on the same subnet, which would be very unforutnate.

That said, we have a fix (that is tested have very good improvement) ready to deploy to this subnet (hopefully tomorrow) via NNS proposals. We also have query caching implemented and currently under going internal testing. So we should expect IC subnets to have significant improvements at handling user traffic very soon.

8 Likes

Thank you for the quick and detailed response.

2 Likes

Happened again with NFT sale, and minting is currently paused to deal with the bottleneck:

We have made a number of improvements and more are in progress to address this performance issue. We have improved the efficiency and parallelism of canister smart contract execution and we have improved our load shedding at the node level. This will enable us to increase the traffic which we allow through the boundary nodes and ultimately to fully decentralize the boundary nodes without risking the health of individual nodes, subnets and the IC overall. We in the process of adding caching at the boundary nodes and we will be adding caching support in the service worker and we are investigating adding caching at the replica node layer as well.

Safety and stability are of utmost importance and while we are disappointed this subnet cannot handle all the load being thrown at it at this time, we are confident that we can continually improve performance.

7 Likes

From the look of it, query calls were less of a bottleneck this time. Rate limiting still kicked in at the peak time (despite the subnet still has capacity due to the conservative measure we took), so we did hit some 503s, but I expect this to be improved very soon.

But your question is pointing to a issue of a different nature, we are still trying to understand what was going on.

5 Likes

I’m confident in you, too!

Perhaps a path for projects to directly submit incident reports (if one doesn’t exist already) would be valuable for events like this.

2 Likes

It looks like there were around 1500 transactions out of the first 7k mint attempts that ended up failing – Bob Bodily from Toniq Labs should have the details.

I think we were running into issues with the ICP ledger canister not being able to handle all of the requests this time around (punks didn’t hit the ICP ledger canister at all) and were getting the majority of the errors from the ICP transfer requests. Happy to get on a call with Stephen (at Toniq) and you all if you want more info.

3 Likes

A short update on more recent progress:

  1. Concurrent query execution and its memory usage has been fixed. Ulan Degenbaev wrote an excellent article on this, worth a read!

  2. We have also deployed query caching on boundary nodes. All query calls are cached for 1 second by default. Since states are not expected to change within 1 second, this should take off a lot of load when a canister becomes the hot spot.

  3. With the caching, query calls are also rate limited separately to protect subnet nodes from being DoS-ed.

With the above fixes, we expect overall performance improvements at handling query loads and emergent traffic. That said, we still have some work in progress:

  • Respect Cache-Control response header that is returned from the http_request public method of a canister, so that developers will have greater control over how static assets are managed and transported.

  • Query cache in replica because it knows whether a canister state has been updated or not.

As always, please help report any problem you notice to help us improve! Thanks!

10 Likes

Hmm, but it would not know whether the canister would behave differently based on time or cycle balance (which can change even if the state didn’t). Some caching might be possible, of course.

Don’t queries also use block time? (Honest question, I don’t know the answer.)

But regardless, due to network latency, queuing, clock drift, hitting a different replica etc. you can’t really expect query responses that are accurate WRT sub-second time or cycle balance changes. E.g. you may issue your query at time T and it can execute at T+.5s; or because replica clocks are not in sync, the other way around.

So whether those inaccuracies come from the above limitations; or from getting a cached response; it doesn’t really matter They will still happen.

I am also not sure. The public spec doesn’t indicate that. I would expect it to be the block time of the most recently finalized block or something like that, whether the canister had a state change there or not. But one could feasibly say it can be up to, say, 10s behind in order to have a larger window for caching.

So whether those inaccuracies come from the above limitations; or from getting a cached response; it doesn’t really matter They will still happen.

Absolutely! As long as caching doesn’t inflate that “acceptable delay“ too much. Caching the query response for hours might be bad here.