Enquiry: How many Terabytes of Data is DFINITY hosting on AWS?

For research purposes we would like to know How many Terabytes of Data is being hosted on Amazon Web Services by DFINITY

Questions

  • Why is DFINITY hosting data on AWS
  • What kind of Data?
  • What are the Operational Costs?

Background:

  1. Access denied on ledger canister wasm - #3 by sat
  2. Transparency within the Dfinity Foundation - #79 by Leamsi

We would like answers from DFINITY. If we are mistaken about this, we would be happy to learn, whether DFINITY does or does not use AWS.

6 Likes

As you can read from Saša’s answer we do use AWS, and I would guess since the file that was not accessible anymore was older than one year and the new retention is 6 months that ~50% less storage is now used. AFAIK it’s mostly build artefacts that you can produce yourself given the public repos.

I doubt the costs will be made public. Just out of curiosity: why do you want to know about AWS usage specifically?

6 Likes

Public blockchain, Curiosity and for Research Purposes. Just to Clarify again

dfinity is hosting 500 TB data on AWS, which is a backup of IC state?

The parts I personally know of are not a backup of IC state. It’s build artefacts from the big monorepo.

3 Likes

Let me preface this by saying, I understand there are multiple reasons why any organization would need to use AWS. It’s obviously heavily used by almost anyone in the web space and has a ton of use cases.

With that said, in my opinion, it doesn’t “look” good when a foundation uses a company/service they are directly competing against. When I (as a user/non-developer) see marketing material saying the IC can do everything AWS can do, but better, and then see the foundation who is promoting the IC using AWS, it makes me wonder “Why aren’t they using their own service if it’s so great?”

Like I said, I understand there can be any number of reasons, technical and non-technical why this may be occurring. I think ICP tech is awesome and am trying to use it whenever/wherever I can. But again, as a layperson, it doesn’t make the foundation “look” like they are willing to put their money where their mouth is and dump AWS to use their own product if it has all the capabilities of AWS right now. Just my opinion.

12 Likes

Guys, if anyone needs to store backup of blockchain state or blockchain history, there’re a couple of project that are almost production ready and designed for exactly THIS use case

https://www.kyve.network/ - for trustless, decentralized data upload to Arweave via Irys(bundlr): https://irys.xyz
It’s already live :slight_smile:

Such solution allows to permanently archieve critical information such as blockchain states/history in a trustless, decentralized way. Pay once store forever.
But I have a feeling community doesn’t know about this technologies

2 Likes

It’s important to be honest about why officials choose to hide their dependence on AWS while also bashing AWS. Where is the trust if you choose to hide it?

1 Like

Well said. It is important to disclose off chain dependencies and be more transparent about the network.

2 Likes

Not totally known with the whole “dfinity storing data on AWS”. But in my opinion it’s not that weird, as long as they don’t run the ic and store on-chain on AWS

I assume Microsoft also uses services from Google or Amazon and vice versa while they are also competitors in the hosting space.

3 Likes

It doesn’t currently have all the capabilities of AWS. I am a layperson and I know that. There is potential that it will in the coming years or decade that it might.

1 Like

I’ve seen games built on the IC, web sites, social media dapps, a data storage app, digital marketplaces, etc. The only thing I can think of off the top of my head that I haven’t seen is a store that sells/ships physical goods.

I’m genuinely curious, what can’t it do that AWS can? Like I said originally, I’m sure there are some things, and that’s why Dfinity would be using it, but I don’t know what they are.

Totally fair opinion and I agree with most of it. I think once storage subnets are a thing this could change pretty quickly.

Related question: Do you think such data should be hosted on-chain? While the IC is an extremely high-availability system, is it really optimal to use it to store everything? I’ve heard of more than a few situations (not related to the IC) where the service that was down was also hosting things that were needed to get it back up and running. Or incidents where the status page is down when the service is down

Thank you for the links! Do you know how much these storage options cost? I couldn’t find it after a bit of looking around…

While I see what you’re getting at, I don’t think it’s fair to say ‘choose to hide’. The data on AWS is relatively unimportant (build artefacts are recoverable from the source code) and AWS going down does not affect mainnet at all.

Store very large amounts of data, and make it cheap. The latest node hardware spec demands 32TB disks, and assuming all of this is available to store data on-chain (it isn’t, but we’ll skip over that for now) and given the 36 subnets we have right now, the capacity of the IC is 36*32TB = 1100TB. I don’t think DFINTY should hog more than half the capacity of the IC

Also, cost is a factor. I don’t know AWS costs, but since there is less replication and (some of) their systems are specifically built to store data it is a lot cheaper. Assuming AWS is 5x cheaper, that would free up ~2.4M USD to fund additional development per year (600TB * 5$ / GB / year).

With that I was also asking why nobody is asking about GCP or Azure. (AFAIK we don’t use these at all, but why do people only care about AWS?)

11 Likes

Hey @Severin, thanks for the reply! I think data should be stored wherever makes the most business sense for the organization storing it. If that’s AWS right now for Dfinity, then go with it. My comments are purely based on optics. If it’s not economically viable to store the amount of data on the IC that you need to store because of current limitations, so be it. But it just doesn’t “look” good (and this is just my opinion) when marketing materials make it seem like AWS is legacy tech and people can replace it right now with the IC. From the two images below from https://deck.internetcomputer.org, that’s the impression I got.


Also, I fully understand marketing includes things that are not currently possible and will be available at some point down the road. I respect that. Personally, I think if you were using some other provider other than AWS, even less people would care.

To me, this is just because AWS is the service that has the most “visibility”, it’s a company/service that most people have heard of in the media, even just in passing, and it’s someone that Dfinity is/will be competing against, even in their marketing material. I’m sure GCP or Azure have had outages, but when AWS has one, it’s “newsworthy” just because of the name.

I wholeheartedly agree allocating resources where they’re necessary. Definitely seems like a good tradeoff. As I said, my whole viewpoint and comments are based on optics, nothing more. Thanks again!

3 Likes

Arweave is ~ 3.5$/GB FOREVER, Kyve and IRYS will add a fixed on top of this, but I’m pretty sure it’s below 5$/GB.
You pay just once and data is stored forever on arweave

storing things forever is a nice promise, but it’s kind of precisely the opposite use case of what Dfinity is using AWS for. We create vast volumes of build artifacts through CI jobs, which need to exist for a little while for testing and development, but most of which can be discarded after a little while.

We’ve already transitioned other temporary storage, such as build previews for internetcomputer.org and so on, to hosting on the Internet Computer, but there still isn’t a great solution to date for the kind of scale that these ephemeral multi-gb files call for

11 Likes

Have you seen where the Arweave nodes are hosted at https://viewblock.io/arweave/nodes? AWS and Hetzner look pretty common.

Also, from the Arweave yellow paper HERE “The Arweave protocol avoids making it an obligation to store everything, which in turn allows each node to decide for itself which blocks and trans- actions to store.” My understanding of that statement is any node provider can choose not to store your information if they don’t want to. Just like most other protocols.

3 Likes

ICP isn’t using AWS.

@Apollon,

The Internet Computer doesn’t rely on AWS; however, we do utilize AWS S3 as a data store for build artifacts. It’s important to clarify that this reliance on AWS is not absolute. We could employ any S3-compatible data storage solution with an HTTPs interface. The choice of AWS S3 is primarily for convenience. Notably, in recent weeks, we have begun pushing IC release artifacts to GitHub as well, and we may explore other storage options in the future.

Currently, we have approximately 500TB of IC build artifacts stored on AWS S3. Unfortunately, we cannot disclose the cost publicly due to confidentiality reasons.

@let4be,

One unique aspect of the IC is the privacy of on-chain data. While this may or may not be the primary differentiator in the future, it is a key consideration for us now. The decentralized nature of subnet nodes, spread across independent nodes globally, ensures the safety and integrity of data against malicious actors. Sharing block data, such as through backups to platforms like Arweave, has irreversible consequences. Once data becomes public, there’s no turning back. To maintain this privacy, we create subnet backups on private machines, accessible only to a select few individuals. Even I do not have access to these machines. Simultaneously, we are actively exploring better methods to ensure privacy and data backups. Storing encrypted data on public blockchains is not a viable option, as it would offer minimal value.

@hehe,

It’s essential to clarify that we are not attempting to hide our reliance on AWS. As mentioned earlier, AWS is a tool in our toolkit, serving as a temporary solution until we transition away from it entirely. Currently, our dependence on AWS is relatively minimal compared to many other blockchain projects.

8 Likes

Blockchain data needs to be backed up, and then the backup is stored on private machines, are you serious?

1 Like

hahaha.“Disaster Recovery” is a disaster.

1 Like