Summary
During the SNS-1 decentralization sale users experienced high latency and errors when interacting with the NNS front-end dapp creating a massive load on the NNS subnet which is configured on purpose to prioritize security and correctness over performance, as it hosts protocol-critical canisters like the registry and the ledger.
The root cause of the incident was that the NNS front-end dapp executes user queries as update calls. Such queries are also known as replicated queries and they provide the maximum level of security. However, they are also more expensive compared to single replica queries, as the message goes through consensus and is processed by all replicas of the subnet.
During the sale of 45 minutes, the NNS processed 3K decentralization sale participations, but this was just a fraction of the transactions processed: due to all the other ingress messages (replicated queries) coming from the users of the NNS front end dapp, the NNS subnet processed over 120K transactions in this duration.
The high rate of replicated queries increased pressure onto the ingress message pool and slowed down processing of actual update messages. Due to several bottlenecks in the ingress message pool, the NNS subnet was processing only ~50 ingress messages per second during the launch. Almost all of these messages were replicated queries coming from automatic retries on the client side.
If the NNS front-end dapp were to use single replica queries with certified variables, then it would be able to handle more than 10K queries per second.
Unrelated to the main issue above, high load caused the SNS subnet to stall due to a bug in state serialization. Note that there were other SNS specific issues like duplicate transitions that will get their own retrospective.
Timeline
2022-11-29:
- 16:43 UTC: Increasing amount of ingress traffic on the NNS subnet ahead of SNS-1 decentralization sale launch, presumably induced by users refreshing the NNS Dapp
- 16:46 UTC: Decentralization sale started as Proposal 93763 is executed.
- 16:48 UTC: Reports of performance degradation of the Dapp and â503 Service is overloaded, try again later.â responses, due to filled up ingress pool and ingress rate limiting.
- 16:54 UTC: API health check alert indicates API HTTP timeouts
- 17:20 UTC: Proposal 94416 to increase the max ingress per block on NNS to 400 is submitted.
- 17:28 UTC: Proposal 94416 is executed.
- 17:32 UTC: Decentralization sale has ended.
- 17:40 UTC: NNS processed pending messages and returned to normal operation.
What went wrong
- The NNS front-end dapp used expensive replicated queries instead of single replica queries with certified variables.
- The NNS subnet was conservatively configured to allow at most 150 ingress messages per block. Other subnets allow 1000 ingress messages per block.
- The pre-launch stress test of SNS-1 was performed against the backend canisters but end-to-end tests including the frontend were only done at a smaller scale. This did not allow us to see the impact of the replicated queries coming from the frontend.
- Internet Identity ingress messages are more expensive to verify, and since the blocks contained many such ingress messages, the block rate slowed down.
- Automatic retries on the client side did not use exponential backoff and contributed significantly to the load.
- All ingress messages sent by the Internet Identity contain two BLS signatures, which are currently not aggregated and are very expensive to validate before they get injected into the blocks.
- The state serialization bug causing the SNS subnet to stall had been identified before during an unrelated load test on a testnet and the fix was in the process of rolling out to subnets as part of the regular release process (but not yet deployed to the SNS and NNS subnets, which are updated last).
What went right
- The NNS subnet did not crash and processed all accepted ingress messages. It returned to normal operation after the sale ended.
Action items
- Consider implementing queries with certified variables instead of replicated queries in the NNS front-end dapp.
- Require all big launches of DFINITY-led projects to perform an end-to-end dry run with the load that exceeds the expected load of the launch.
- Optimize ingress message validation for Internet Identity ingress messages.
- Use increasing delay in js-agent for retried calls instead of firing them off immediately after rejection.
- Implement caching wherever possible on the NNS front-end dapp.
- Optimize SNS APIs for more efficient interaction with the front-end.
Technical details
The main bottleneck in processing of ingress messages on the NNS subnet was a configuration parameter that was conservatively set to allow only 150 ingress messages per block. Other subnets have this parameter set to 1000. Another bottleneck was in validation of BLS signatures. All ingress messages sent by the Internet Identity contain two BLS signatures, which are currently not aggregated and are very expensive to validate before they get injected into the blocks. That caused the block rate of the NNS subnet to drop to ~0.3 blocks per second.
Additionally, the current implementation of agent-js library automatically retries all failed requests up to 3 times on the client side, without exponential backoff.