Canister backup

Dear DFINITY team,

I maintain a small canister (kind of a toy social network) and I get more and more concerned about a potential data loss and ZERO support for the canister data backup. So far, I was able to backup the data by serializing the entire canister state and export it via a query, but it obviously doesn’t scale. Since recently my canister supports uploading pictures and my poor man’s backup solution ran into its limits pretty quickly. So now there is no way for me anymore to have a full backup of my canister and it’s pretty scary.

Of course I could start implementing my own backup solution — extracting data in chunks (since one query can only return a limited amount of data) and then implement a restoration system, which would consume the backup data via small ingress messages and assemble a full state again, but it feels like I’m starting solving problems which do not even exists on Web 2.0 instead of actually building a decentralized service.

My questions are what is DFINITY’s roadmap regarding canister backups and deadlines? Thanks!

8 Likes

It is better than a toy. It is bad ass. Don’t lose my messages! I’m making mad ICP (.07 so far…woot!)

If you run across this message and haven’t checked it out, it is at https://6qfxa-ryaaa-aaaai-qbhsq-cai.ic0.app/. (I understand a desire to not self promote so I’ll do it for you) :slight_smile:

I’ve thought a good bit about this as well. I’ve considered wrapping a data class in an object that makes upgrades and backups super easy. I’ve been told the ability to upload/download canister state is coming and that should drastically reduce the vulnerability here, but I don’t know the timeline.

2 Likes

Haha, @skilesare thanks so much for the kind words! 0.07 ICP is not that much though — Taggr’s frequenters made enough money not for just a cup of coffee, but for a good bottle of wine :grin:

Back to our problem. I’ve just did the following:

  1. Implement an update function dumping the state to stable memory (in my case - I just call my pre_upgrade)
  2. Implement a query reading the stable memory page by page.
  3. Implement an update function writing the stable memory page by page.
  4. Implement an update loading the state from the stable memory (in my case - I just call post_upgrade).
  5. Implement a bash script automating all this blasphemy.
  6. Wait for an official way to backup the state :innocent:
5 Likes

Yep…that is what I was thinking.

That’s actually a pretty neat way, nicely piggy-backing on what’s needed for the upgrade story anyways.

If it’s not already there I would add safe guards against corruptions (partial writes or reads, dumping to stable memory while downloading etc.). But otherwise a good solution!

1 Like

Yeah….we have a HALT mode that stops all updates other than backup and updateHALT from running. It would work here as well.

Thanks @nomeata! No sure I understand the partial writes concern, but maybe I missing something. My canister only touches the stable memory on upgrades, that’s it. So before I call the backup I explicitly call a functions calling pre_upgrade which atomically dumps the heap into the stable memory. After that it’s not expected to be changed. The restoration is also updating the stable memory first and then atomically deserialising the heap from it in a separate message. So I’m not sure why halting is needed assuming I never upgrade and backup at the same time?

PS: There is obviously a race possible during the restoration that some updates might get lost, but the restoration is not an ordinary operation and is only required when some serious data loss or corruption has happened, so loosing updates is not a real concern here.

1 Like

It sounds to me like you’ve done most of the work already! Why not take the time and make a small package / lib / blog post out of it, and maybe @skilesare can help out with a bounty from ICDevs for your troubles. win-win :wink:

3 Likes

Sounds like a good grant project…hint hint :wink:

I’m not sure it’s worth it. It’s still a poor man’s backup solution: I’m currently reading and writing the stable memory page by page (because query’s payloads are limited in size). I’ve set the page size to about 1Mb. I also need to base64 the strings. Now every query takes about half a second for one page! Restoring of the backup will obviously take even longer. So for canisters with a heap of hundreds of megabytes, let alone gigabytes this won’t really work.

1 Like

I’m currently reading and writing the stable memory page by page […] I’ve set the page size to about 1Mb

I remember Rick from dscvr saying something among the same lines. I believe they figured that you start at a higher number, and if the call succeeds then you continue, if not you retry the query with lower page size.

As to converting to base64, is that really necessary? If you go the rust way, you should be able to query a vec of bytes, right?

2 Likes

Yes, that’s what I started with. But the problem is that the blob returned by dfx is still encoded (to be printable I guess), but the size of the encoding is then larger than what I can send as a command line arg to dfx when I use it to restore the state (e.g. on my local replica).

Probably worth replacing the shell script with a proper backup client using the rust agent, then no binary data needs to be pretty-printed.

It seems that the interface (dump_to_stable, fetch_page, upload_page, load_from_stable) is actually generic enough so that this tool could be used by anyone.

As for safeguards: i’d probably add safeguards where, for example, dump_to_stable bumps and returns a counter that’s then passed to fetch_page to prevent you from accidentally downloading half the pages from an older and the other half from a newer backup. This could happen if you query very quickly after the dumping, or if some other admin dumps or upgrades while you download. Extra bonus points if that token happens to be a hash of the whole stable memory (can be calculated by dump_to_stable while writing), then you can check the download image. Similar for uploading.

2 Likes

Great inputs, thanks a lot @nomeata! If I get a bit more time, I’ll write a small agent-rs based tool to avoid the encoding issues and definitely add the integrity check. Then I’ll open-source it.

1 Like

It may be worth adding the ability for icx to stream the data from stdin instead of requiring it be passed as a parameter, for Bash script usability.

Maybe I’m missing something obvious, but why do you serialize the canister state to stable memory first before reading from and writing to it? Why not just operate on the canister state (in wasm linear memory) directly?

Not sure how you imagine this? Literally reading the heap page by page? But the heap might be changing under your feet while you’re doing it, right?

Hey there!
Why don’t you just use another canister for the backup?
Your data is already backed up by at least 7 nodes in the subnet. If you think something bad could happen to it, just flush all the data to another canister.

First, it is just more secure (7 nodes on IC vs. 1 EC2 instance on AWS, or even your personal pc).
Second, it can be done in a permissionless manner. For example, you could enable your users to backup (or even publish in a first place) their articles if they care about their persistance. You could deploy a personal backup canister for each of them (on demand and not for free), where they could store it.

Moreover, one day you’ll definitely come to a moment, when your single-canister setup is not enough to store all the data your app has. This way you could front-run this situation.

At some point there was talk of forking canisters . I don’t remember how that ended up, but it would be insanely useful for backups and for scaling. It is much easier to copy a canister and delete the first half of the data on the first and the second half on the other than to do 4GB/2MB= 2000 intercanister calls to move data from one to another.

2 Likes

It’s actually doesn’t matter if the platform clones the whole canister to another subnet or if you do that with inter-canister calls. For the network it is the same exact amount of load.

But inter-canister call based cloning is available right now and they allow you to spread that load through time.