In what way is persisted data protected from being obtained e.g. by a malicious actor breaking into my datacenter and gaining physical access to the nodes running in my racks? It might sound like a far fetched scenario, but i’m eager to understand the underlying mechanics of how secrecy is ensured with the way orthogonal persistence is implemented.
This is a good point. And this is also a tough problem. I think to compete with the large service providers. persisted data would need to be encrypted so if someone got access to the canister all they would get is encrypted data.
Well Dominic did mention multiple times they will be using SEV-ES so this bodes well for confidential computation and data at rest capabilities that would prevent even someone with physical access to the hardware to snoop on the data.
SEV-ES gives some guarantees that the machine is running the intended code and that the machine operator or anyone with physical access can’t peek at the contents of the encrypted VM. In theory it gives you both Integrity and Confidentiality.
Different users may feel more or less confident in the degree of protection from SEV features. (There have been quite a few CPU vulnerabilities in recent years.) Even aside from CPU flaws, this will need a lot of new software from Dfinity that may have bugs, and that has no track record.
But apparently Dom feels confident it will prevent data leaks.
The thing that confuses me is this: if Dfinity trust that SEV protects against malicious DCs, why do they need the expensive consensus mechanism? You could just run the code on any one machine, and if you trust SEV then you have a guarantee that the code is executed correctly.
Dom seems to be taking the position that SEV can be trusted for confidentiality but not for integrity, which is hard to understand. If you don’t control the software that sees the plaintext state, how do you know it keeps it confidential?
So is this question resolved, or still outstanding? I need complete confidentiality.
As far as I can tell there are no confidentiality guarantees. Whichever anonymous operator is running the node where your canister happens to get scheduled can read your data.
It gets even more crazy with badlands introducing many more amateur operators.
Wow - that is BAD!
This could stop me from using IC for the project.
Have you or anyone else had a chance to think of a solution to this situation?
Sounds like this is not even being considered by anyone?
So it seems like this problem has not been addressed and solved yet, is this correct?
So is the solution to encrypt the data yourself before it is persisted?
Maybe there is another solution?
I don’t work on IC or for Dfinity. I agree it’s a huge concern about using it for anything but very low value services or serving public data.
Yes, very difficult for me, because I like the IC and wanted to use it, but cannot for this project.
But I will be able to use IC for a different project later, not so confidential as this one.
I left a tweet on Dom’s tweet about SEV-ES, I’ll post the response here if he replies.
I think deeply about these topics and have discussed them in many places. DM if you need help
The only path right now is implementing your end to end encryption yourself.
There is the encrypted-notes-dapp example in the Dfinity repo that shows one way to do it.
Your inputs and outputs are on the subnet log as well. You need to do end to end encryption. One method to rule this out would to be to have some kind of multiparty compute that behaves like t-ecdsa that can operate in encrypted data or in an enclave.
I’d say we are a couple years from that. In the mean time there is lots of tooling to build.
e2e encryption is great for the limited situations where it works, although obviously it precludes most interesting server side processing.
In this example it seems IC is just used as dumb untrusted storage. That works, but wouldn’t it work equally well on a public cloud?
@lastmjs I did skim and search your huge post history but could not find a previous answer. Would you be so kind as to link to the posts where you address this?
I’ll just post from a DM since it would probably be useful to all:
If you need extreme levels of certainty, then the IC probably won’t work for your use-case yet. Even with secure enclaves, there is the possibility of the node operators breaking in. And consider that each subnet ideally has many independent node operators, so you run the risk of each node operator breaking in independently of the others.
A few technologies may help in the future:
- Node shuffling/rotation
- Secure enclaves
- Secure multi-party computation
- Homomorphic encryption
Only the first two are likely to be implemented ant time soon. 3 and 4 are still developing technologies in general.
So, you have to consider the risks in light of the current architecture. On AWS you have two main legal entities that have access to data, yourself and Amazon. Legal agreements I believe are in place that would help prevent abuse by Amazon, but technically they can peek in many cases.
On the IC canisters don’t have individual legal agreements with node operators. They may be a legal agreement between DFINITY and node operators, but I’m not sure if that will remain in the future.
You can encrypt data client-side, but the main difficulty there is storing the encryption key securely and with a good UX. Where do you put the encryption key? And if the data is encrypted inside of the canisters, it’s very difficult to compute on the data. So you break many use cases.
There is actually an another way of making your data hidden from individual malicious node operators.
This is not like some proven algorithm with publication papers and stuff - so, please, take it with a grain of salt. And it does not prevents node operators from reading your data, but it makes it much harder to do. With this protocol applied, node operators now have to cooperate in order to read your data.
Imagine you have a data entry, and a set (a cluster) of K canisters providing identical functionality. All of these canisters implement the two following functions:
#[update]
fn upload_chunk(chunk: Chunk, chunk_id: ChunkId) {
// persists the chunk, indexing it with the provided chunk id
}
#[query]
fn get_chunk(chunk_id) -> Chunk {
// returns the chunk
}
When you want to upload some data (a file, for example), first you split this data in N chunks. Then, you apply an erasure code function (e.g. Reed-Solomon or Fountain Codes) to these chunks. This function will return you two sets of chunks: original chunks and parity chunks. The idea is to only use parity chunks (so in case of Reed-Solomon you want to generate N of them; in case of Fountain Codes, you want to skip first N of what it will generate).
These parity chunks are special. Each such chunk tells nothing about your original data, but if you collect N of them, (with high probability) you will be able to reconstruct the original data. If you can’t collect all N parity chunks, then you won’t be able to reconstruct your data.
So, once you have these parity chunks, you split them in K groups and send each group to some distinct canister in the cluster (so each canister will only receive N/K chunks from you). In order to generate chunk ids you just use random values. Ofc, you have to remember what chunk ids represent the file you’ve just uploaded (you can store this index locally for maximum security, but it is also possible to create a separate “indexing” canister, where users store these indices).
In total what do we have:
- you stored the file to a cluster of canisters (that is distributed and stored in different subnets, on different physical machines);
- canisters only store parity chunks and these chunks are mangled with the erasure code algorithm, so there is no way for a node operator to even partially understand what’s inside the chunks they hold in their subnet;
- since you know chunk ids and canisters ids where each of these chunks are, you can fetch them back and reconstruct your data locally.
Now, no individual node provider can access the information you’ve stored. In order to reconstruct it back they would need to obtain all N chunks. In other words, they would need to cooperate with other subnets. The more canisters you have in the cluster, the more malicious nodes would have to cooperate together in order to reconstruct your data, the harder the problem is.
Yes, this is still possible to do, but it is much-much harder now, especially if subnet/canister shuffling and/or hardware encryption will be implemented. From this point it is no more a question of “can a node operator access my data?”, but instead “does my data cost more than the cost of identifying and cooperating with K-1 other malicious node operators?”.
The only problem that I see with this solution is that there are also boundary nodes, which can be malicious and can accumulate different chunks, until they are able to reconstruct the data. This can be fixed, if the channel between the user and the subnet node is encrypted - honestly I don’t know if that’s true.
This is an interesting idea and helpful input to the process of solving the problem. At some point, this problem must be solved so we can use the IC with maximum confidence in data privacy.
Why is data privacy necessary on IC? You don’t get complete data privacy on other blockchains or traditional clouds.
It is necessary for certain applications. If you want to build a decentralized system for X and X is any industry where competitors might use the system then you need a solution to convince CTOs that there is little to 0% chance of others gaining access to their data. Without solving the problem you are stuck with the use cases you already have on todays blockchain. Basically DeFi and maybe some social networks where all data is public.