High Demand - Performance IC (AIRDROP SNS) - Important :motoko_go:

Something curious happened when the SNS airdrop was launched, that the 3 applications involved (dscvr, distrikt, OP) but especially dscvr, had degraded performance.

Simple questions…

  1. What is the performance limit in cases of high demand?
  2. Is it something that developers can do something or is it exclusively in dfinity?
  3. Is it only related to the nodes that process and is it solved with more nodes?
  4. How is the growth scheme to avoid having this type of performance degradation?

Thank you very much !!

Plus: The only thing I noticed different is that the burning increased x2 as I commented on twitter

12 Likes

Yep actually made DSCVR unusable… And I guess traffic didn’t reach a 1% of a social network such as twitter.
So… My biggest concern is: is ICP ready for commercial application use?

3 Likes

Great question. I was surprised how slow it became.

2 Likes

You also have to think that growth is progressive and it was an atypical situation… but let’s see what the dfinity team responds.Thank you !

Right ! I was surprised

Great questions @FranHefner and @gatsby_esp!

I discussed them with @dsarlis and @stefan-kaestle. Here is our combined response.

We will focus on query performance and throughput here because that was the main reason for the degraded user experience during the airdrop.

Q1: What is the performance limit in cases of high demand?

There are several factors that affect the query performance/throughput:

  1. Is the query cacheable or not cacheable? A query is cacheable if its request headers and input parameters do not change frequently.
  2. How many instructions does the query execute?
  3. How much memory does the query read and write?

In the best case, the query is cacheable. Then it is handled by the boundary nodes and the expected throughput is about 100K queries per second per boundary node. The total throughput scales linearly with the number of boundary nodes.

If the query is not cacheable, then the query performance depends on how much work in terms of executed instructions and accessed memory the query does when running. If the query is optimized and does very little work (e.g. executes a few thousand instructions and touches a few pages), then we can expect the query throughput of ~3000 queries per second per node. On a subnet with 13 nodes, that would translate to about 40K queries per second. This number can increase by 10x with more investment in the core IC optimizations.

If the query is not well optimized and executes millions of instructions, then the query throughput decreases proportionally to the performed work. This is similar to query performance in traditional Web2. That’s why it is important that developers stress test and optimize bottlenecks in the query code before a big launch and expected workload peaks.

One area for improvement here are tools for monitoring and debugging performance of canisters that can be developed both by the community and DFINITY.

Q2: Is it something that developers can do or is it exclusively in dfinity?

As explained above, both the query code of canisters and the IC code are critical for performance.

If the query does a lot of work (millions of instructions), then the highest impact is on the developer side. By optimizing the query to execute thousands instead of millions instructions, the developer can increase the throughput by several orders of magnitude. Typical performance optimizations known for other systems also apply to the IC. Examples are 1) to avoid random memory accesses in favor of linear access and leverage locality, 2) cache and index data that is read frequently.

On the IC side, we see a potential for at least 10x improvement in query performance with more investment in optimizations.

Q3: Is it only related to the nodes that process and is it solved with more nodes?

Queries are handled by individual nodes locally and do not require communication with other nodes of the subnet. Therefore, adding more nodes will improve the query throughput proportionally. However, as mentioned in the previous answer, adding more resources might not be necessary if software in both canister code and IC code are continuing to be improved.

Q4: How is the growth scheme to avoid having this type of performance degradation?

The best way to avoid performance degradation is to stress test and profile query performance to find and remove the bottlenecks:

  1. Make sure that most queries are cacheable.
  2. Measure the number of instructions executed by a query using ic0.performance_counter().
  3. Estimate the number of memory pages accessed by the query and leverage locality and linear access whenever possible.
  4. Consider more optimal data-structures and other well known optimization strategies such as indexing.
  5. If you want to grow beyond O(100K) queries per second, then consider scaling out to multiple subnets.

Q5: Is ICP ready for commercial application use?

We are not aware of any fundamental limitation that would prevent commercial applications to use the IC effectively and scale to a large number of queries per second. As outlined from the above answers and as with other systems, there is a combination of factors that might contribute to query throughput. An application that is designed to tackle efficiently the expected load by taking advantage of caching on the boundary nodes, having optimized queries that do as little work as possible and potentially even use multiple canisters to serve queries and avoid memory bottlenecks when hitting a single canister should be able to scale to hundreds of thousands of queries per second, especially when also scaling out across subnets.

20 Likes

wow very happy with the detailed answer and the time spent explaining to the community!

Excellent work !! The technical part makes sense, now that I think about it, even if you have any provider, for example IC or AWS, if the code is not optimized it will be slow.

What happened in this specific case then is that dscvr was not preparing for such a large number of queries in such a short time (that has a positive view that users are increasing) and another view that dscvr has to improve the code to ensure that it does not have degraded performance in these types of scenarios.

Again, thank you very much ! I put your answer as a solution and it will remain as a truly useful answer to the questions that surely someone new would ask.

A hug to the whole team ! especially @ulan @dsarlis and @stefan-kaestle :smiling_face_with_three_hearts:. I think this answer deserves to be on tw, with your permission I share it, thanks!

11 Likes