High User Traffic Incident Retrospective - Thursday September 2, 2021

Thank you for the quick and detailed response.

2 Likes

Happened again with NFT sale, and minting is currently paused to deal with the bottleneck:

We have made a number of improvements and more are in progress to address this performance issue. We have improved the efficiency and parallelism of canister smart contract execution and we have improved our load shedding at the node level. This will enable us to increase the traffic which we allow through the boundary nodes and ultimately to fully decentralize the boundary nodes without risking the health of individual nodes, subnets and the IC overall. We in the process of adding caching at the boundary nodes and we will be adding caching support in the service worker and we are investigating adding caching at the replica node layer as well.

Safety and stability are of utmost importance and while we are disappointed this subnet cannot handle all the load being thrown at it at this time, we are confident that we can continually improve performance.

7 Likes

From the look of it, query calls were less of a bottleneck this time. Rate limiting still kicked in at the peak time (despite the subnet still has capacity due to the conservative measure we took), so we did hit some 503s, but I expect this to be improved very soon.

But your question is pointing to a issue of a different nature, we are still trying to understand what was going on.

5 Likes

I’m confident in you, too!

Perhaps a path for projects to directly submit incident reports (if one doesn’t exist already) would be valuable for events like this.

2 Likes

It looks like there were around 1500 transactions out of the first 7k mint attempts that ended up failing – Bob Bodily from Toniq Labs should have the details.

I think we were running into issues with the ICP ledger canister not being able to handle all of the requests this time around (punks didn’t hit the ICP ledger canister at all) and were getting the majority of the errors from the ICP transfer requests. Happy to get on a call with Stephen (at Toniq) and you all if you want more info.

3 Likes

A short update on more recent progress:

  1. Concurrent query execution and its memory usage has been fixed. Ulan Degenbaev wrote an excellent article on this, worth a read!

  2. We have also deployed query caching on boundary nodes. All query calls are cached for 1 second by default. Since states are not expected to change within 1 second, this should take off a lot of load when a canister becomes the hot spot.

  3. With the caching, query calls are also rate limited separately to protect subnet nodes from being DoS-ed.

With the above fixes, we expect overall performance improvements at handling query loads and emergent traffic. That said, we still have some work in progress:

  • Respect Cache-Control response header that is returned from the http_request public method of a canister, so that developers will have greater control over how static assets are managed and transported.

  • Query cache in replica because it knows whether a canister state has been updated or not.

As always, please help report any problem you notice to help us improve! Thanks!

10 Likes

Hmm, but it would not know whether the canister would behave differently based on time or cycle balance (which can change even if the state didn’t). Some caching might be possible, of course.

Don’t queries also use block time? (Honest question, I don’t know the answer.)

But regardless, due to network latency, queuing, clock drift, hitting a different replica etc. you can’t really expect query responses that are accurate WRT sub-second time or cycle balance changes. E.g. you may issue your query at time T and it can execute at T+.5s; or because replica clocks are not in sync, the other way around.

So whether those inaccuracies come from the above limitations; or from getting a cached response; it doesn’t really matter They will still happen.

I am also not sure. The public spec doesn’t indicate that. I would expect it to be the block time of the most recently finalized block or something like that, whether the canister had a state change there or not. But one could feasibly say it can be up to, say, 10s behind in order to have a larger window for caching.

So whether those inaccuracies come from the above limitations; or from getting a cached response; it doesn’t really matter They will still happen.

Absolutely! As long as caching doesn’t inflate that “acceptable delay“ too much. Caching the query response for hours might be bad here.