Hey there,
in our project we’re observing an interesting performance anomaly. We’re storing encrypted messages in canisters and tried to identify the best batch sizes for reads and writes. Checking different batch read window sizes (read 10, 100, 1000, 10000 messages at once) we got these measurements:
> ./test.exs read_bench
...snip...
Name ips average deviation median 99th %
read 10 344.41 2.90 ms ±18.94% 2.85 ms 4.25 ms
read 1000 100.74 9.93 ms ±25.17% 9.60 ms 16.75 ms
read 100 21.30 46.96 ms ±1.89% 46.90 ms 49.54 ms
read 10000 1.51 660.43 ms ±2.61% 653.82 ms 697.74 ms
...snip...
So there is this outlier here at reading messages with a batch size 1000. Reading a single batch of 100 messages takes ~47ms on average but reading a single batch of 1000 messages takes only ~10ms on average! So we’re getting more in less time! We’re calling always the same function in the canister zone_availability_canister/src/DiodeMessages.mo at master · diodechain/zone_availability_canister · GitHub
public func get_messages_by_range(store: MessageStore, min_message_id: Nat32, in_max_message_id: Nat32) : [Message] {
let max_message_id = Nat32.min(in_max_message_id, get_max_message_id(store));
if (min_message_id > max_message_id) {
return [];
};
Iter.range(Nat32.toNat(min_message_id), Nat32.toNat(max_message_id))
|> Iter.map(_, func (i : Nat) : Message {
let offset = get_message_offset_by_id(store, Nat32.fromNat(i));
get_message_by_offset(store, offset);
})
|> Iter.toArray(_);
};
Which is wrapped by the actor:
public shared query func get_messages_by_range(min_message_id: Nat32, max_message_id: Nat32) : async [DiodeMessages.Message] {
DiodeMessages.get_messages_by_range(dm, min_message_id, max_message_id);
};
And the bench code is checking the returned number of messages. So we know that is correct.
Going deeper into this and trying to isolate the breaking point, it seems that once we cross a certain response size this “turbo anomaly” kicks in:
It’s wild that the performance improves 10x suddenly when we cross a certain threshold. So my question is what can we do to dig deeper into this and find the root cause? I was initially thinking there might be some caching in the replica, but this is not on repeated requests. Might this be a local only dfx replica issue, a Motoko compiler anomaly?
Also is this test then even representative for what we will experience once deployed on a real network?
Any suggestions from an ecosystem expert on how to debug this further would be amazing!
Thanks!