I maintain a small canister (kind of a toy social network) and I get more and more concerned about a potential data loss and ZERO support for the canister data backup. So far, I was able to backup the data by serializing the entire canister state and export it via a query, but it obviously doesn’t scale. Since recently my canister supports uploading pictures and my poor man’s backup solution ran into its limits pretty quickly. So now there is no way for me anymore to have a full backup of my canister and it’s pretty scary.
Of course I could start implementing my own backup solution — extracting data in chunks (since one query can only return a limited amount of data) and then implement a restoration system, which would consume the backup data via small ingress messages and assemble a full state again, but it feels like I’m starting solving problems which do not even exists on Web 2.0 instead of actually building a decentralized service.
My questions are what is DFINITY’s roadmap regarding canister backups and deadlines? Thanks!
I’ve thought a good bit about this as well. I’ve considered wrapping a data class in an object that makes upgrades and backups super easy. I’ve been told the ability to upload/download canister state is coming and that should drastically reduce the vulnerability here, but I don’t know the timeline.
Thanks @nomeata! No sure I understand the partial writes concern, but maybe I missing something. My canister only touches the stable memory on upgrades, that’s it. So before I call the backup I explicitly call a functions calling pre_upgrade which atomically dumps the heap into the stable memory. After that it’s not expected to be changed. The restoration is also updating the stable memory first and then atomically deserialising the heap from it in a separate message. So I’m not sure why halting is needed assuming I never upgrade and backup at the same time?
PS: There is obviously a race possible during the restoration that some updates might get lost, but the restoration is not an ordinary operation and is only required when some serious data loss or corruption has happened, so loosing updates is not a real concern here.
It sounds to me like you’ve done most of the work already! Why not take the time and make a small package / lib / blog post out of it, and maybe @skilesare can help out with a bounty from ICDevs for your troubles. win-win
I’m not sure it’s worth it. It’s still a poor man’s backup solution: I’m currently reading and writing the stable memory page by page (because query’s payloads are limited in size). I’ve set the page size to about 1Mb. I also need to base64 the strings. Now every query takes about half a second for one page! Restoring of the backup will obviously take even longer. So for canisters with a heap of hundreds of megabytes, let alone gigabytes this won’t really work.
I’m currently reading and writing the stable memory page by page […] I’ve set the page size to about 1Mb
I remember Rick from dscvr saying something among the same lines. I believe they figured that you start at a higher number, and if the call succeeds then you continue, if not you retry the query with lower page size.
As to converting to base64, is that really necessary? If you go the rust way, you should be able to query a vec of bytes, right?
Yes, that’s what I started with. But the problem is that the blob returned by dfx is still encoded (to be printable I guess), but the size of the encoding is then larger than what I can send as a command line arg to dfx when I use it to restore the state (e.g. on my local replica).
Probably worth replacing the shell script with a proper backup client using the rust agent, then no binary data needs to be pretty-printed.
It seems that the interface (dump_to_stable, fetch_page, upload_page, load_from_stable) is actually generic enough so that this tool could be used by anyone.
As for safeguards: i’d probably add safeguards where, for example, dump_to_stable bumps and returns a counter that’s then passed to fetch_page to prevent you from accidentally downloading half the pages from an older and the other half from a newer backup. This could happen if you query very quickly after the dumping, or if some other admin dumps or upgrades while you download. Extra bonus points if that token happens to be a hash of the whole stable memory (can be calculated by dump_to_stable while writing), then you can check the download image. Similar for uploading.
Maybe I’m missing something obvious, but why do you serialize the canister state to stable memory first before reading from and writing to it? Why not just operate on the canister state (in wasm linear memory) directly?
Why don’t you just use another canister for the backup?
Your data is already backed up by at least 7 nodes in the subnet. If you think something bad could happen to it, just flush all the data to another canister.
First, it is just more secure (7 nodes on IC vs. 1 EC2 instance on AWS, or even your personal pc).
Second, it can be done in a permissionless manner. For example, you could enable your users to backup (or even publish in a first place) their articles if they care about their persistance. You could deploy a personal backup canister for each of them (on demand and not for free), where they could store it.
Moreover, one day you’ll definitely come to a moment, when your single-canister setup is not enough to store all the data your app has. This way you could front-run this situation.
At some point there was talk of forking canisters . I don’t remember how that ended up, but it would be insanely useful for backups and for scaling. It is much easier to copy a canister and delete the first half of the data on the first and the second half on the other than to do 4GB/2MB= 2000 intercanister calls to move data from one to another.