ToniqLabs uploads GBs and GBs of assets to the IC every week. When we upload, there is a good chance that the subnet crashes, or is at least severely impacted by data upload. Here is a screenshot of Internet Computer Network Status when we started to upload some assets. Notice that finalization rate goes down, cycles used goes way up, and state increases over time. The problem is that oftentimes it is so taxing on a subnet that it just dies (errors), and requires pulling the state to figure out where your upload left off, and then restarting the upload.
The subnet is not crashing, it is merely rate limiting updates. Because of the many memory writes, the orthogonal persistence implementation accumulates lots of “heap deltas” (modified heap pages) and, in order to avoid running out of RAM and crashing, stops handling any more transactions until the next checkpoint.
This affects the latency of any other transactions executing at the same time (execution round takes 2 seconds instead of <1) and the interval between checkpoints (these happen every 500 rounds by default, with those 500 rounds are stretched over longer a period) but not much else.
See above, this is merely an effect of executing large ingress messages. Not entirely sure why uploading a couple of MB of images requires 4B instructions, but that’s what I’m seeing. (These metrics cover the whole subnet not just your canister(s), but there was very little other traffic happening on that subnet before your uploads, so I’ll just assume here that what I’m seeing is due to your uploads.)
Also an effect of the large ingress messages and large number of instructions. Are you doing image handling in your canister? I.e. rebuilding an image out of tiles? That may account for the CPU usage.
This is merely because of the uploads. Your canister is holding more and more images. I’m seeing a gradual increase from 1.25 GB to 1.5 GB over the first half of the 3 hour interval covered by the charts above (uploads, I suppose), then a step increase to 3 GB.
The errors are ingress messages timing out. Ingress messages have a maximum TTL of 5 or 6 minutes and because the subnet stops handling transactions for up to 5 minutes; and the messages themselves require a couple of seconds each to execute; I’m seeing 5 ingress messages expiring just around the time (and everytime) when transaction execution resumes.
There should be no need for that. I understand that transactions hanging for 5 minutes and timing out is annoying as heck, but apart from the timed out transactions, all others should reliably execute and succeed (unless, of course, the canister itself fails them). But regardless, all updates that return success will have resulted in the write going through; all updates returning errors (timeout or trap) will have resulted in no changes to the state. There should be no need to do anything beyond retrying failed updates. If you find that’s not the case, then it’s definitely an issue that requires investigation.
I hope my comments above covered the first part (what’s happening). Regarding the second (what can be done to fix it), a couple of things can.
For one, you can rate limit your uploads, so the subnet doesn’t run out of heap delta so early in the 500 rounds checkpoint interval. This is merely a workaround and not something you, as a canister developer, should have to deal with. We need to figure out better ways of handling huge heap deltas. I know we’re already working on offloading them to disk instead of keeping them in RAM. Maybe it would also be feasible to have a dynamoc checkpointing interval instead of a fixed one, as long as it’s deterministic (e.g. “every 500 rounds or the heap delta reaches the configured limit”); IIUC this is also connected to the DKG interval, so I’ll ask the Consensus team about it.
Second, regarding cycle burn rate, you can try figuring out why your canister needs 4B instructions to handle what is presumably 2 MB worth of images. It’s not going to change anything about the heap delta and the resulting suspension of transaction execution, resulting in huge latency and expired ingress messages, but it should make it cheaper to upload assets.
Third, you can better keep track of the status of each ingress message you send. If it fails for any reason, retry it. If it succeeds, you’re all done. There should be no need to pull the state to figure out what went through and what didn’t.
If I understand correctly, a canister that changes state of a single boolean value from true to false could take 5 minutes if an ingesting canister is on the same subnet and is concurrently causing heap deltas to be created.
Perhaps rate limiting should not be left in the hands of the developer if one dapp is able to impact the performance of an entire subnet. It is very simple to mount a denial of service attack simply by doing a large upload if your canister is resident on the same subnet.
Is there a performance monitoring canister on a subnet or a system variable that can be queried that indicates what the performance looks like in the subnet? Similar to a ping? It would ge nice to warn users - expect delays. One could always measure the nanoseconds between calls in a loop but that may exacerbate the problem?
It may be true that “the subnet did not crash” but a users/dapp owner’s perspective may be quite different and less nuanced.
100% correct @northman. We actually put out a fix for this in this recent release (see the item “Execution: Introduce per canister heap delta limit to share the heap delta capacity in a fairer manner between canisters.”) That change basically adds rate limiting of the generated heap delta on a per-canister basis. So if a single canister does many writes then it should no longer prevent the whole subnet from making progress. Instead, just that canister may be prevented from running for several rounds.
Also, there’s ongoing work to use mmap-ed memory for heap deltas, so the OS can easily offload that memory to disk as necessary som the maximum heap delta can be dictated by available disk rather than available RAM (so there will virtually be no max heap limit anymore). This is just another of the ICs growing pains.
@bob11, any chance you could share the relevant canister code (and what kind of traffic the canister was handling at the time)? The Motoko team want to understand why the canister was touching so much memory (in order to find a general solution to the underlying problem, whatever it turns out to be).