TL;DR Recently the subnet blockchain mpubz on the Internet Computer had a production issue, and (part of) the fix was the complexity adjustments for some of the System API calls:
|System API Call||Old Complexity
It’s a very sensitive topic for the community, not only from a financial point of view, but also from a technical standpoint. Unfortunately, the complexity adjustments were required to fix the production defect.
Please read on for more details.
Recently, execution rounds on one of the subnet blockchains on the IC started taking a very long time to finish (around 50s vs the target of about 1s). Here is a short recap of the root cause.
A subnet node executes a message. The message handler in Canister Smart Contract has a long loop of 1-byte stable reads (see the image below). For each system API call, it needs to exit the WebAssembly module, execute native code on the Internet Computer, and then continue the execution of the WebAssembly.
Every such 1-byte stable read is much more CPU complex than an average WASM instruction. But that was not the issue. To prevent the canister from executing messages forever, every Canister already has limits in place. Though, to be deterministic across replicas, those limits are in WASM Instruction, not in seconds.
One of those limits we have is the 5B Instructions per Message limit. In case with the stable read in a loop, it will take almost a minute to hit the limit and to finish the execution round, while the target is just 1 second per round.
To make sense out of those instruction limits, the complexity of each System API call must correctly reflect the actual node CPU/memory/disk/network complexity.
Hence, one of the postmortem items was to measure the actual complexity of the System API calls, and to update them accordingly.
How to compare the complexity of WASM Instructions? The natural choice would be to go with the number of WASM Instructions per Second (IPS) as the main metric. First, a baseline must be set up, then a benchmark should be run for each system API call, and the results compared to the baseline.
How to set a baseline? Just an empty WASM loop with a counter of iteration seems to be a good choice: there is a load for a counter, increment, store it, branch… Maybe not a very representative WASM workload, but it’s good enough for this not-that-scientific benchmark. And it also allows to embed system API calls into the loop quite easily.
With a bit of math, the complexity adjustment for each System API call could be calculated:
If you’re interested in more details, please see EXECUTE_UPDATE.md on GitHub.
Using production metrics, the average number of executed instructions per second in production is known. So that light blue area in the middle of the graph below is the “target”. Both basic WASM instructions and more complex system API calls should be within this light blue area.
Please see the first navy blue bar. That’s the result for the baseline benchmark with an empty loop. The benchmark result is 3.4B Instructions per Second. The result is within the production light blue area, so it confirms the baseline benchmark makes sense.
Now see the second result, the ic0.stable_size system API call. It’s a relatively cheap syscall, it just exits the wasm module, grabs the size, and gets back to execution.
The benchmark shows that by adding the ic0.stable_size call inside the baseline loop, the loop becomes 3.4 times slower (3.4B vs 1B). The reason is that the complexity of the ic0.stable_size call is just one WASM Instruction, while in reality the operation is much more CPU complex.
Applying the formula from the previous slide, the “complexity adjustment” for the ic0.stable_size is +24. In other words, this syscall should be attributed for 24 instructions more to make it “fair” and be aligned with the baseline.
Please see the third bar, the 1-byte ic0.stable_read production issue we already discussed at the beginning. The IPS here drops to just 100M Instructions per second. In other words, the loop with stable read inside is 34 times slower than the baseline loop. It also means that it will take 50 seconds to hit the 5B limit of instructions per message (5B/100M), while our target round duration is just 1 second.
Applying the formula again, the complexity of the 1-byte ic0.stable_read should be increased by 442 instructions more for this 1-byte ic0.stable_read.
The last bar, ic0.stable_read of a big chunk of stable memory. Here the situation is completely the opposite. As the complexity of stable reads includes every byte we read, the complexity of this syscall is extremely high. And, applying the formula, the complexity of the 8K ic0.stable_read should be lowered by 7K instructions to be aligned with the baseline.
The results are polarized: the complexity for some System API calls must be increased, while for some use cases the complexity is too high in comparison with the baseline.
The most important result though is that the problem is acknowledged, and now there are tools in place to compare the results before and after the adjustments. Also, around 40 system API calls are benchmarked. To close the production issue, the complexity of 5 system calls was adjusted as shown at the beginning. Not adjusting the complexity could lead to future production issues, and possible Denial of Service attacks.
Please note, the calculated complexity adjustment for 1-byte ic0.stable_read was +442 instructions, while it was adjusted to just +20. It was done to avoid any production Canister breaks.
Clearly, it’s not the end of the story. The performance improvements of the Internet Computer is a work in progress, the Internet Computer becomes faster and safer. There will be some great wins, and unfortunate mistakes ahead.
It’s an exciting journey, and we’re happy to be here, with you.