When preupgrade is running in a fragmented heap, there may not be enough capacity to push all the information into stable variables and externalize them into a monolithic buffer (which then will be copied to the stable memory space). We are aware of these two problems:
need to run an explicit GC before preupgrade (and possibly after)
need to use a streaming technique to write stable variables to stable memory.
I am working on the former and @claudio is about to tackle the latter.
I knew there was an intermediate buffer, but I didn’t know it would take that much space… From <800 MB to “heap out of bounds” (i.e. 3.2 GB used) seems quite extreme.
But yeah, those two workarounds sound good. Do you happen to have an ETA on either of those solutions? (No rush but would be helpful to plan.) Also, would you recommend users migrate to using the ExperimentalStableMemory Motoko library, which I assume avoids these problems?
That doesn’t look like Motoko allocation failure to me, which produces a different error message (IIRC) and more like trying to address wasm memory that hasn’t been pre-allocated by a memory grow.
Is there any chance you could share the offending code or a smaller repro?
This issue still seems to be unresolved. When I upgrade the same wasm (motoko), the canister with little memory usage can be upgraded successfully, but the canister with 70M memory usage cannot be upgraded, and the error “Canister qhssc-vaaaa-aaaak-adh2a-cai trapped: heap out of bounds” is reported. This problem has been encountered several times.
Thanks for the report! Though 70 MB doesn’t seem to be so high that the heap would be in danger due to serialisation of stable variables. Would you mind giving us the moc version used and a few hints about how the heap structure differs between the “little memory” and the “70M memory”? Ideally, if this is a test canister, a reproducible example of canister code + message sequence leading to the problem would allow us to come up with the fastest bug fix.
Is there any chance your data structure has a lot of internal sharing of immutable values, i.e. is a graph, not tree, of data? The serialisation step can turn a graph into a tree by duplicating shared values, which might exhaust memory on deserialisation.
If all the stable vars are left unchanged, the upgrade can be done successfully.
This issue occurs if a stable var (trie structure) with thousands of data is deleted. It seems to trigger a reorganization of the whole stable memory, because we found that canister memory increased to more than 100M during the upgrade.
I wrote an ICRC1 contract f2r76-wqaaa-aaaak-adpzq-cai and now have 27M of memory with 18590 records in Trie.
I didn’t change any stable vars and this issue occurred when I upgraded. It seems to be related to the number of records.
Upgraded code: (one element of the array can not put too much data?)
// upgrade
private stable var __drc202Data: [DRC202.DataTemp] = [];
system func preupgrade() {
__drc202Data := [drc202.getData()];
};
system func postupgrade() {
if (__drc202Data.size() > 0){
drc202.setData(__drc202Data[0]);
__drc202Data := [];
};
};
In memory, a Blob is representated by a 4-byte pointer to a byte vector, and references to the same blob share the vector using an indirection.
Your tries use those blobs as keys in a number of places (the fields of TempData)
When multiple reference to the same blob are serialized during upgrade, each reference to the
blob is serialized to its own vector of byes in the stream, losing all sharing that was present in the in memory representation.
I think that may be the problem here. It’s a known issue that the current stable representation of data can blow up do to loss of sharing. I think, unfortunately, you may have been bitten by this.
If that is the problem, the only workaround I can see would be to introduce another map that maps TxId to a small integer, storing the large blobs just once as key, that you then use as small key for the other tries.
The key is a 32-byte blob, using a trie structure, and inside the trie the key is 4 bytes (generated using Blob.hash). .
Does this cause any problem? 32-byte keys are common in blockchain applications.
Our key is not integer incremental, it is generated according to a rule that requires 32 bytes to avoid collisions. You know, BTC, ETH are such rules.
If this is the reason, what can be done to improve it?
Should I use the ExperimentalStableMemory tool? But there is no didc toolkit in motoko to convert structural data to blob.
Right, but I think the leaves of the trie will actually still store the full 32-byte (not bit) key, to resolve collisions. So each trie will contain its own reference to the key blob, which is then unshared on upgrade. But maybe I’m barking up the wrong tree.
I don’t disagree that what Motoko provides is insufficient and we actually want to improve it in future. But that won’t happen soon.
Given the current state, I can make these observations:
There are a few more stable Tries indexed on Blobs, and each transaction, of which there could be many, stores a sender and receiver Blob, many of which I’d imagine to be shared between transactions (same sender or receiver).
Can you not maintain a two way map from Blob to BlobId internally, and then just use the
small BlobId to reference that particular Blob from other data structures (like a data base table and key). That, would, I hope, mitigate the explosion on upgrade.
If easy to arrange, maintaining the large data in stable memory directly would also avoid the issue, but that’s probably a lot more work.
It may work, but it doesn’t solve the problem at all. I don’t think a 32 byte blob is a big overhead, compared to the 32 bytes per int on EVM. This feels like maintaining a database. But there is no database engine in canister, and it is impossible for the subnet to allocate enough computational resources to each canister.
So I would like to make a suggestion.
Be upfront with reality: Canister is first and foremost a blockchain category, it is not a server or a database. It is not the same as EVM, but it is blockchain based, so draw on Solidty’s years of experience and solutions. More support for k-v map structure and restrict full table traversal comparison. There are enough Dapps on eth to prove that most of the business logic that relies on full table matching comparison can be achieved by k-v reading and writing through technical modifications. IC has higher scalability than ETH, so there is an opportunity to support it better.
Mechanisms can be designed to guide developers into a new paradigm. For example, it is possible to allow a controller to deploy a group of canisters on a subnet, and calls between this group of canisters are treated as internal calls, supporting atomicity. Then the developer can design the proxy-impl-data (MVC) pattern to separate the entry, business, and data in different canisters, and in fact the developer just needs to upgrade the impl canister frequently.
After some experimentation with upgrades, I manage to provoke the same error. It is a bug in the deserialization code, which requires more Rust/C stack than it has available as well as a better stack guard.
A fix is under review that should mitigate the failures, if not rule them out.