High User Traffic Incident Retrospective - Thursday September 2, 2021

A short update on more recent progress:

  1. Concurrent query execution and its memory usage has been fixed. Ulan Degenbaev wrote an excellent article on this, worth a read!

  2. We have also deployed query caching on boundary nodes. All query calls are cached for 1 second by default. Since states are not expected to change within 1 second, this should take off a lot of load when a canister becomes the hot spot.

  3. With the caching, query calls are also rate limited separately to protect subnet nodes from being DoS-ed.

With the above fixes, we expect overall performance improvements at handling query loads and emergent traffic. That said, we still have some work in progress:

  • Respect Cache-Control response header that is returned from the http_request public method of a canister, so that developers will have greater control over how static assets are managed and transported.

  • Query cache in replica because it knows whether a canister state has been updated or not.

As always, please help report any problem you notice to help us improve! Thanks!