I don’t know what’s going to happen once I hit the memory heap limit, and am curious if anyone here has hit the heap limit or tested out what it’s like to tip-toe on the edge and see what happens right when one goes over (4GB).
A few questions.
What happens to the existing data in the canister?
What happens to the state of the canister from that call (trap & rollback)?
a. What happens if I’m attempting not to add to, but to modify a number (say increment a counter) and I hit the cycles limit?
b. What happens if I query a canister and the intermediate data structures produced by that query makes that canister hit the heap size limit.
Is this error catchable?
What happens if a canister hits the heap size limit during an upgrade (i.e. wasm is larger than previous wasm)?
I’m building IC data storage tooling for applications, so I’d like to both enable and protect/guard developers from having to face these scenarios. Before starting the cycle burn to do this testing, any tips or stories from the community or internal DFINITY team members that have experienced or tested this out would be super helpful.
Expectation: If an update call would take the canister over the limit, the update call will simply fail, so the user who made the update call will get a reject.
It would be cool if someone could test this with a toy canister.
I did make a similar test for numbers rolling over and there the behaviour was also that the update call would be rejected and the data would remain valid.
You are actually limited to around 2Gb of heap memory if you want to use the upgrade hooks. As it needs around 2 times the memory space taken by your actual data in order to function properly during the deserialize process. So if your canister state has more than ~2Gb it will render your canister unable to upgrade because it will fail during the post_upgrade.
Memory error like so are not catchable as they are unrecoverable panics from the runtime. From what I tested so far, the canister just roll back to it previous working state.
In looking at the docs for create_canister, it looks like one can attempt to avoid this by setting the memory_allocation parameter.
From the docs:
memory_allocation (nat)
Must be a number between 0 and 2^48^ (i.e 256TB), inclusively. It indicates how much memory the canister is allowed to use in total. Any attempt to grow memory usage beyond this allocation will fail. If the IC cannot provide the requested allocation, for example because it is oversubscribed, the call will be rejected. If set to 0, then memory growth of the canister will be best-effort and subject to the available memory on the IC.
Default value: 0
Therefore, it looks like calls are rejected if the memory_allocation threshold is set and passed, but calls trap + fail if this is not set and the canister heap memory capacity is exceeded.
I do have a few follow-up questions though so that I can use the memory_allocation parameter most effectively.
Is memory_allocation more directly related to the Prim.rts_heap_size, Prim.rts_memory_size, or something else entirely like the management canister’s canister_status method?
I’m trying to figure out a good way to monitor when a canister will hit this memory_allocation or run out of canister/heap memory entirely. For the time being, let’s simplify things and say that subnet itself is not a limiting factor (i.e. there is more canister space available on the particular subnet).
Also, Is it possible to derive the remaining canister memory from within the canister (i.e. through prim)? I’d prefer not to involve inter-canister calls if possible, and would also prefer not to have to keep a running count of all the bytes that enter my canister via update calls.
Actually the memory_allocation docs are a bit misleading.
As Roman replied to the other question about compute allocation, it accumulates over time like storage cost. Therefore, if you set a higher value for compute allocation, it’s expected that your canister will run out of cycles sooner than if you use the default 0 (which means best effort scheduling for your canister). Even more so if you’re doing some load test where you also presumably burn a bunch of cycles.
Similar things can be said if you set a high memory allocation. Even if you don’t use all the memory you’ve set as your allocation, the system will charge your canister for it (because we’ve reserved it on your behalf and will not be taken by anyone else).
If no memory allocation is set, then we fall back to best effort, meaning that if your canister attempts to increase its memory usage, it’ll succeed if there’s enough memory capacity left in the subnet. Your canister will be charged for the memory it’s using at any given time, with the risk that if the subnet is quite full, you might not be able to claim the memory you need at some later point.
So the initial canister memory is not 4gb, it’s tiny (don’t know exactly) and setting a value for memory_allocation means that you actually reserve and pay for it from the beginning.
We have issue with one of our asset canister here, its heap memory is full (more than 4.2Gb) and now it is un-responding. So I’m afraid that we have effectively lost all the assets that are on this canister, as even light queries don’t work. (Only status from dfx works)
Thank you for your sacrifice and contributions to the community. This canister will not have died in vain.
Curious though, it sounds like the unresponsive mess is due to overflowing the heap during an upgrade, and not during an update/query call (which would have just trapped).
It sounds then like all Motoko devs should monitor the Prim.rts_heap_size inside a canister and ensure it never surpasses 2GB.
There is no upgrade involved here. The canister is a data bucket from a modified Rust bigmap-poc implementation. For our current needs, it is not supposed to be upgraded.
There is support for “splitting” a data canister in bigmap-poc. It’s not the most optimal split, but it exists.
Seems like there is a bug in the code that calculates the canister size and thus the splitting never started?
Or you don’t run the periodic maintenance checks within the bigmap-poc @dymayday ?
I’m not sure I understand how the heap would overflow during an upgrade and you wouldn’t get a trap. If you ever try to access more than 4GiB of heap, there should be a trap no matter what (as the Wasm heap is limited to 4GiB at the moment). Essentially, any message execution that somehow attempts to access beyond 4GiB should trap, regardless of this being an update/query call or upgrading the canister.
@dymayday well that’s unfortunate. Without the “maintenance” call running, this is bound to happen, since the amount of data in the canister will keep growing forever, until this limit is hit.
I wasn’t aware that the cycle limit is being hit. Back when the code was developed (and that was a while ago), the cycle limits were quite different.
It would probably be worth fixing the maintenance method before anything else. Do you know what needs to be done to fix it, or do you need help?
If it’s not a secret, which app is this?
I’m working on Distrikt.app.
I’m picking up where the previous dev left so I’m still fuzzy on the implementation tbh. But I would much appreciate your help on the matter for sure !
I haven’t had this happen or tested it out, but I’ve had several discussions with others that mention the dangers of an error occuring during the preupgrade system method, which could result in a canister no longer being upgrade-able.
I don’t know anything about @dymayday’s issue, but from what he’s said my hypothesis (not verified) is that this cycles overflow originally occurred during the maintenance method, meaning that the canister was no longer being reparitioned. Still live and functioning, the canister then kept filing up with data until it surpassed the heap memory limit, at which point it was no longer responsive.
Another thing that could of happened (but didn’t in this case) is that the canister could have overflowed the heap and trapped during the preupgrade method. This would mean that the data serialization process during upgrades was overflowing the heap meaning every upgrade attempt would trap, resulting in a canister that was longer upgrade-able. The canister would then keep filling up until it is no longer responsive.
It would be great if there was some canister data recovery mechanism such that canister which is no longer upgrade-able due to heap overflow limitations can have it’s entire state downloaded by a controlling principal of the canister.
This is probably something that should be abstracted and native to the replica and not a library or method that is included after the fact, as I would imagine many developers will find themselves in this exact same state as @dymayday somewhere down the line.
This is something that has been discussed internally in the past and we even have a feature request about it. Unfortunately, we haven’t been able to prioritise it given other work that has been occupying the relevant teams but it’s something I hope we can get higher on the list in the not so far future.