High load simply means too many canisters wanting to do too much at the same time for the subnet to be able to do it. Same as your laptop/desktop when you try to play a FPS while the OS is updating and you’re compressing a huge video. Only a lot worse, because we have subnets where literally 20k canisters have something to do all at once.
Right now, you can sort of guesstimate it by the number of updates (which is not foolproof: 4 huge updates per round can fully occupy a subnet); plus whether the subnet is experiencing drops in block rate: because instruction costs are not 100% accurate and because of high contention (lots of canisters doing memory writes; or simply the context switching), a round on a highly loaded subnet often takes more than the 400ms required to maintain a block rate of 2.5 blocks/round; whereas a mostly idle subnet almost always completes rounds in under 400 ms.
We are also looking into exposing more metrics on the public dashboard, such as “scheduler latency” – how many rounds does the average canister have to wait before getting scheduled. FWIW, after the recent scheduler improvements, with the exception of one subnet (bkfrj
) this number is consistently below 2 rounds. So on most subnets right now the main source of latency is round duration: whenever a round takes a couple of seconds to complete, that’s a couple of seconds times 3 or 4 (because Consensus runs ahead of Execution; so a backlog of blocks builds up) that your ingress message has to wait before it’s eventually executed. More consistent round durations is also something that is being worked on right now.