Introducing Performance Counter on The Internet Computer

berestovskyy · June 24, 2022, 7:23pm

Availability

WebAssembly: (import "ic0" "performance_counter" (func $ic0_performance_counter (param i32) (result i64)))
– available on all the subnets of the Internet Computer
Motoko: ExperimentalInternetComputer.countInstructions(comp)
– available in the latest Motoko and Motoko Playground. Note: as the Performance Counter doesn’t support measuring computations with inter-canister calls, the Motoko is designed the library to reject async functions to prevent misuse at the moment.
Rust: ic_cdk::api::call::performance_counter(counter_type: u32) -> u64
– available in the latest Rust CDK
The Internet Computer SDK (i.e. dfx):
– available in the latest beta.

What Is It?

The Canisters are encoded in WebAssembly. By executing a Canister method, in fact we execute WebAssembly instructions:

The Performance Counter is a new System API call, so Canisters can ask the Internet Computer how many instructions the Canister executed so far in the current message.

The counter is reset between messages. Note, if you use async calls, each async point corresponds to a new message.

Why Do We Need It?

The Performance Counter is a way to know runtime the complexity, and hence the cost of a piece of code.

We can use this information to profile our Canisters, benchmark our libraries, or even to dynamically charge other Canisters based on exact work done.

Are There Any Alternatives?

There are many profilers, flame graphs etc. The key differences are:

The Performance Counter is an exact number of WebAssembly Instructions, not an estimation.
The information is available runtime, during the execution, so the Canister can take decisions based on that.

Demo

Here is a small demo on Motoko Playground:

Fun fact, as I’m a backend engineer, it’s probably, my first Motoko program

High-level, it’s a bit modified Counter Motoko tutorial, but with two stable counters.

Click Deploy, and then mark Enable Profiling to enable the Flame Graphs.

`incCounter()`

  stable var counter : Nat = 0;
  public func incCounter() : async Nat64 {
    EIC.countInstructions(func() { counter += 1 });
  };

The first public function incCounter() just increases a stable counter. Note, the actual work is wrapped inside the countInstructions(), and it returns the number of instructions consumed by the lambda inside.

In the Candid UI on the left, call incCounter.

It returns 638 WebAssembly instructions. Why so many?

First of all, the ic0.performance_counter() System API call takes 200 instructions for a start. That’s an overhead of exiting the WASM Execution and providing the information to the Canister.

Next, profiling to build the Flame Graph. If we deploy the Demo with no profiling enabled, it will take just 234 Instructions, out which 200 instructions is a fixed overhead of the Performance Counter itself. Profiling is expensive, but invaluable!

Let’s compare the results with the Flame Graph. Hover the mouse over incCounter and it shows 929 WebAssembly instructions estimated by the profiler. Flame Graph includes many other things, like Candid arguments deserialization and function result serialization, so it’s expected to be more than the pure countInstructions() call.

Those are still comparable results. So far so good!

`incBigCounter()`

  stable var bigCounter : Nat = 12345678901234567890;
  public func incBigCounter() : async Nat64 {
    EIC.countInstructions(func() { bigCounter += 1 });
  };

Let’s try the second counter. As you can see, everything is exactly the same, but the initial value of the counter is huge.

Click Call in the Candid UI to call the incBigCounter()

Previously it took 638 WebAssembly Instructions, and now it’s 5330

Flame graphs to the rescue! We see that Motoko has started to use BigInt library underneath, as the initial value was too big to represent in normal form.

That’s a super simple, yet a great example why we need to profile our Canisters with real data structure sizes. The more data flows in, the more instructions it might take to process the requests. Even on such a basic level of a single counter.

Comparing with the Flame Graph estimations, now we start to see a difference. The Flame Graph estimates the call to ~2K Instructions, while in fact it took ~5K. But it will be super clear in the last example.

`readStableMemory()`

  public func readStableMemory() : async Nat64 {
    EIC.countInstructions(func() {
      var o = StableMemory.grow(1);
      var b = StableMemory.loadBlob(0, 65536);
    });
  };

In the readStableMemory() we just grow the stable memory, and read a binary blob from it.

Let’s call it…

Ok, so it took ~66K WebAssembly Instructions, but since the most of the work was done on the backend, on the System API level, the flame graph estimation shows just ~1K.

And that is why we need the Performance Counter! It’s precisely what the Canister will be charged for, it includes all the work done in WebAssembly and on the System API level.

Links

The Internet Computer Interface Specisication:
The Internet Computer Interface Specification | Internet Computer Home
Canister Cycles vs Performance Counter Wiki (or why async functions are tricky):
Comparing Canister Cycles vs Performance Counter - Internet Computer Wiki
This Demo on Motoko Playground:
Motoko Playground - DFINITY

Next Iterations

The Performance Counter interface is extensible, so a few new Counters might be added. At the moment we’re considering to add:

A counter which returns Cycles, not WebAssembly instruction.
A stable counter which will be growing across async calls/responses.

It was a small feature, but many changes across components and teams.
Thanks @roman-kashitsyn @ulan @claudio @chenyan @lwshang and everyone in Execution, Languages, Runtime, SDK teams!

icme · June 24, 2022, 10:54pm

Thanks for this great tool! Looking forward to using it.

Can you elaborate on how much overhead we can expect when writing to the heap in Motoko vs. writing to/incrementing variables Rust? Has this same performance test been done for the equivalent Rust code?

rckprtr · June 25, 2022, 12:03am

Extremely excited for this. What versions of the CDK-RS and the DFINITY Client do we need to use this?

ic-cdk = “0.5.2”

Im thinking it will be DFX “0.10.2” ? Just tried 0.10.1 and got asm module has an invalid import section. Module imports function 'performance_counter' from 'ic0' that is not exported by the runtime."

Thank you,
Rick

chenyan · June 25, 2022, 5:10am

Some clarifications:

performance_counter doesn’t support measuring computations with inter-canister calls. This is a limitation of the system API itself, not a limitation of Motoko. We design the library to reject async functions to prevent misuse.

The deserialization and serialization is used to decode input arguments and encode return values. It happens for all Candid methods, not specific for profiling. One limitation of the Motoko library function countInstructions is that it cannot measure the cost of Motoko runtime behaviors, such as encoding/decoding arguments and GC. But we can capture this in the flamegraph. So the comparison of these two numbers is a bit misleading, they measure different things: countInstructions measures the cost only for the function body, while flamegraph measures the whole canister call.

Quite the contrary, if the Rust canister is writing data directly to stable memory, it has more overhead, because the stable memory API is expensive, as we can see from the last example. Also Motoko base library has support for accessing stable memory directly, we choose not to do this for stable variables precisely because of the overhead.

A bit more background on why the flamegraph is not using performance_counter yet:

performance_counter resets for each message, so we cannot use it to measure functions with inter-canister calls
The flamegraph is generated by instrumenting Wasm binary. If we use performance_counter, the instrumented instruction itself will be counted towards the cost in the flamegraph, skewing the measurement result. But this can be mitigated by subtracting the instrument cost.
It seems the only thing we didn’t measure correctly in the flamegraph is the bulk memory instructions, which is easy to fix in the instrumentation code.

berestovskyy · June 25, 2022, 6:08am

hey @chenyan,
You’re right, I just rephrased all your points but one:

As you said below, the performance counter does not capture public function prologue/epilogue, so those 638 instructions definitely can’t be attributed to Candid. That’s the Flame Graph profiling, and if you try to deploy the Demo without profiling enabled, it takes just 234 instructions, out of which 200 is the Performance Counter itself.

UPD: I get your point, and both statements seems to be true: 1) the serialization captured on the graph is related to the Candid and 2) the Performance Counter captures the flame graph profiling.

So, I rephrased this again. Thanks a lot for clarifications, @chenyan

berestovskyy · June 25, 2022, 6:37am

Hm, probably, the manifest is still not updated… It’s available in the latest SDK beta: Release 0.10.2-btcbeta.0 · dfinity/sdk · GitHub

The following worked for me:

DFX_VERSION="0.10.2-btcbeta.0" sh -ci "$(curl -fsSL https://smartcontracts.org/install.sh)"

berestovskyy · June 25, 2022, 6:50am

Sorry, it was not quite correct comparison. Some might expect that counter += 1 takes like 5 instructions (load-inc-store), not 638. So my goal was to explain why it’s so many, not comparing it to Rust.

So, I clarified in the original post two points:

And the last point, Motoko allocates stable variables on the heap, so a bit of overhead in comparison with local variables will always be there.

and:

Note, if we deploy the Demo with no profiling enabled, it will take just 234 Instructions, out which 200 instructions is a fixed overhead of the Performance Counter itself. So, no worries about Motoko performance or overhead.

Hope that addresses your concerns, @icme

rckprtr · June 25, 2022, 4:06pm

Awesome works

[Canister rrkah-fqaaa-aaaaa-aaaaq-cai] Performance Data 1 467,420
[Canister rrkah-fqaaa-aaaaa-aaaaq-cai] Performance Data 2 420,468,819

ic_cdk::println!(
        "Performance Data 1 {:?}",
        ic_cdk::api::call::performance_counter(0)
    );

I put Data 2 at the bottom of a function that causes quite a bit of issues, very cool.

chenyan · June 30, 2022, 11:30pm

Thanks for the update. I also update the flamegraph to reflect the real system API cost.

As an example, here is flamegraph for readStableMemory:

Screen Shot 2022-06-30 at 4.25.48 PM

The cost of anon-func-17.27, which is one stack above countInstructions, is the real cost of the function body we want to measure. You will notice that the number measured by the performance counter is slight higher, because it includes the cost of the profiling instructions itself. If you deploy the same code without profiling, you will see the outputted number will be very close to what flamegraph measured.

lastmjs · July 6, 2022, 7:32pm

I’m having a really hard time using the performance counter in Motoko, considering it only allows a synchronous lambda. In Rust and Azle (which uses Rust under-the-hood) the performance counter API is working fantastically. I am able to capture all of the Wasm instructions up to the point of calling performance_counter(0), and it will capture all work (I assume) including Candid deserialization of the parameters (apparently the Rust CDK performance counter does not capture the instructions for Candid deserialization).

But in Motoko it seems like a whole lot of instructions are being cut out, and the lambda makes it very difficult to use the performance counter nimbly in many different situations.

The current API makes it difficult for me to do in-depth benchmarks across Azle, Rust, and Motoko since the Motoko EIC.countInstructions leaves so much out compared to the Rust CDK. I also really want the garbage collector to be in the performance counter for Motoko, as Azle is also garbage collected and I would assume its gc is included in the performance counter (it uses Rust CDK under-the-hood).

I would like to request the API be changed to match the Rust CDK API, just exposing a simple performance_counter function that returns the number of instructions up to that point.

chenyan · July 6, 2022, 9:21pm

Nope. It allows you to call performance_counter anywhere, but it doesn’t mean you always get the correct result. After each await, the counter resets to zero. So the only meaningful way to measure the instructions is within a synchronous lambda. Unless you change the Rust CDK, it doesn’t capture the work for Candid either.

lastmjs · July 6, 2022, 9:26pm

You can store the result of a previous performance_counter call just before an await boundary correct? If this is true then it seems relatively simple to add up all instructions across await boundaries with the Rust CDK, but not with the Motoko sync lambda.

I think it would be much more flexible to expose the same API in Motoko and let the developer decide when to call it. Storing a value before the await boundary doesn’t seem that difficult to do, but the sync lambda seems to prevent a lot of use.

chenyan · July 6, 2022, 10:02pm

Theoretically yes, but unless you do this at the Wasm level, you still cannot cover all costs.

let start = call performance_counter();
...
global_counter += call performance_counter() - start;
let res = canister.method(...).await?;
...
global_counter += call performance_counter();

In the above code, the costs we didn’t cover are: 1) the system API cost for making inter-canister calls; 2) Candid serialization cost for canister.method; 3) the Candid cost for the main method where this code resides; 4) The Candid deserialization cost for res is counted though.

In Motoko, you can use Prim.performanceCounter(0) to get the equivalent of Rust. Just be careful to know what exactly you are counting.

lastmjs · July 7, 2022, 4:08pm

This could be exactly what I was looking for. How do I access this though (what is Prim)? I don’t see this in the Motoko Reference: Home | Internet Computer or the base library: motoko-base/src at master · dfinity/motoko-base · GitHub

lastmjs · July 7, 2022, 5:15pm

Is Prim from an emoji? motoko-base/ExperimentalInternetComputer.mo at master · dfinity/motoko-base · GitHub

domwoe · July 7, 2022, 5:35pm

import Prim "mo:prim";

Prim.performanceCounter(0);

lastmjs · July 7, 2022, 5:40pm

Oh that’s better, thanks!

lastmjs · July 7, 2022, 5:43pm

I’m getting this:

main.mo:8.1-8.22: import error [M0010], package "Prim" not defined

lastmjs · July 7, 2022, 5:46pm

I had to do import Prim "mo:⛔";

domwoe · July 7, 2022, 5:51pm

I’m sure I used that in the past. Funny…

Topic		Replies	Views
Ic_cdk::api::instruction_counter in Motoko? Getting Started	4	445	November 20, 2023
Per-Call Instruction Counter (performance counter type 1) Developers	0	318	June 2, 2023
Performance testing best practices Developers	3	595	December 20, 2023
Introducing " motoko-benchmarking-ts", a TypeScript module for Motoko canister benchmarking Programs & Applications	0	33	August 7, 2024
Best Practices for Optimizing Canister Performance Rust	1	47	December 30, 2024