The IC vs Arweave/IPFS for Distributed Data Storage

Following up on a debate which came up on twitter about the tradeoffs between using the IC for distributed data storage vs something like Arweave/IPFS. I think this is an interesting topic so I wanted to share some more of my thoughts around it. A basic question raised by @lastmjs was, why would you use something like Arweave (or IPFS) as a data storage layer when the IC provides both storage and computation?

Two possible answers I can think of:

  1. You don’t have to deal with WASM memory limits, or even subnet capacity limits on Arweave/IPFS. On the IC, you can work around this by breaking up your data into multiple canisters, but then you start having to make inter-canister query calls which are currently slow. Also your code just tends to get a lot more complex when you have to do this. Hopefully this situation will improve soon, but I still think this counts as a tradeoff as of today.
  2. There are applications which require lots of data to be stored which isn’t frequently accessed. Highly regulated industries like health care/finance frequently have to store 5-10 years of customer data for compliance purposes but then never touch it. The major cloud providers offer specialized, low cost, storage options catered to this use case such as Amazon S3 Glacier. However, all nodes on the IC have a consistent, and fairly beefy server spec since they are designed for both general purpose storage and computation. I don’t really understand how it will be possible to compete as a low cost storage option with such expensive servers.

To be clear, I believe that the IC is the best general purpose blockchain solution. As far as I know, there are no other general purpose blockchain solutions for both scalable storage and computation. However, I am highly skeptical of any “one size fits all” solution. A typical tradeoff that comes whenever you attempt to engineer a more general purpose system is that you tend to give up some specific edge case optimizations. I think the IC is no different. I think the IC will need to be complemented by other technologies to fill some gaps, and I think that some form of distributed disk storage like Arweave/IPFS will likely be one of them.

Would love to hear thoughts from others on this. What are some other pros/cons of augmenting the in-memory data storage native to the IC with a distributed disk storage layer like Arweave/IPFS? Will composing the two technologies be a useful pattern in the future or will all data and all smart contracts eventually move towards living entirely on the IC? Are there any developers here who have faced this choice and want to share their experience?

9 Likes

Very good points. I agree on point 2. I would lend much less weight to point 1 though, since improvements are relatively imminent. Canisters will scale up to possibly 300gb (possibly more in the future), and any simple storage solution would abstract away the multi-canister complexities, offering a simple API for storage. I don’t see why an S3-like API wouldn’t be possible today, or at least a functioning prototype.

So if you’re choosing today which to use, I agree the IC is probably not the choice for distributed storage if you’re trying to go into production. But long run, assuming costs can be managed, the IC may provide an elegant mass storage solution.

4 Likes

Nice post. On point 2; for data that needs to be stored securely and is to be accessed very infrequently(i.e. AWS S3 Glacier), I think Swarm provides the best decentralized storage solution.

1 Like

I’ll add that if you need your data to be accessible via contextual computation then you may want to consider the IC as well…the other platforms can’t do that. ie: files that can he due to social, financial, or other data context.

Maybe other platforms can do some simple gating of content, but they also probably have to duplicate the data to make it accessible to those different role sets via a different encryption per role. The IC can host the highest res copy of the file and hand out an infinite number of context sensitive views of that data via real time computation.

3 Likes

Agreed. Maybe a third point or just a better overall point for 1 is that as of today, some of the tooling is just further along for some of these other technologies. For example you can stream data directly from S3 to IPFS. This sort of thing can, should, and likely will be done on the IC too, as you mentioned. But if you just want to get an application up quickly today and you have a lot of data, using something like IPFS could be more of an out of the box solution for pure storage.

Edit: Thought I should also mention that Fleek already has a lot of cool tools for working with S3 and IPFS too. They can also help you deploy frontend canisters to the IC in like three button clicks. Would be exciting to see them come out with more support for backend canisters.

1 Like

I also remember to read somewhere that there are plans for different subnet-types like a subnet-type optimized for storage, so storage on canisters on those subnets will be cheaper.

3 Likes

Let me make sure I understand this. Basically what you’re saying is that I can store a single copy of some (potentially sensitive) data on the IC and then make that single copy available to different end users with different levels of access. I can limit the fields they are allowed to view, serve anonymized versions of the sensitive fields, etc., or I can allow access to the sensitive data if the end user/application has permission. In any event, I can verify who is requesting the data and then compute an appropriate copy to hand out in real time. Am I getting this right? Also, can I process those different access requests in parallel as long as they are query calls?

Correct…or, if it is media, you can manipulate the media.

2 Likes

Yes. Although you’ll probably be using Rust for all that.

Yes. Although you’ll probably be using Rust for all that.

Interesting. Why do you say that?

That is very, very cool. Would probably be useful if I collect a list of advantages of using the IC for data storage and add those to the original post as well. I’ll work on that.

1 Like

I mean, technically, if IC storage costs are still too high, we could all vote to lower it via an NNS proposal.

Not sure the tokenomics implications of that.

2 Likes

But wouldn’t it be wasteful to store some kinds of data on the IC currently?

For example a video streaming service would store multiple encoded copies of a video at different resolutions. And imagine millions of people uploading videos daily. That kind of data shouldn’t live on the RAM as it currently does on IC canisters. It would be unnecessarily costly where the disk layer of the hardware would be unutilized.

The way I currently look at IC canister storage is like database storage. If I were bringing current architecture ideals to the IC, the data that requires quick edits or lookup but is mostly textual or numeric in nature would go to canister storage. Things that would have gone into my database.

But data that needs large storage and streaming capabilities like video or images is probably better served from a different hardware storage mechanism (disks). Ideally, this is where storage subnets could come in. Hardware connected to the network with lower processing capability but loads of storage would comprise of these subnets.

Just thinking aloud. Still learning about this stuff. Please correct me if my understanding is incorrect

3 Likes

A storage subnet definitely make sense.

I wonder how much of the IC replica binary uses RAM versus disk? I know the node requirements ask for a lot of RAM, but wouldn’t stable memory get orthogonally persisted to disk? Not sure what is stored where currently.

I would definitely prefer to have everything stored in IC in an ideal world. After all, that’s a major selling point of the IC blockchain… it’s not only compute, it’s compute + storage.

2 Likes