AMD SEV Virtual Machine Support

Summary

Enable node images to be run as virtual machines, improving data center adoption while continuing to support privacy-protecting subnets.

Status
In Progress

What you can do to help

  • Ask questions
  • Propose ideas

Key people involved
@khushboo-dfn1 , @bogdanwarinschi , @jplevyak , @hcb , Rudiger Kapitza

Relevant Background

We need to enable our node images to be run as virtual machines to aid data center adoption and support privacy-protecting subnets. AMD SEV will allow us to do this with minimal R&D effort. It can provide confidentiality against a potentially buggy/malicious hypervisor and other code that may happen to coexist on the physical server. AMD SEV allows transparently encrypting the memory of each VM with a unique key and provides an attestation report that can help VM’s owner verify that the state is untampered and is being run on a genuine ADM SEV platform. This is especially relevant for IC, where IC-OS is deployed on remote servers that are not under our control. It can reduce the amount of trust needed to be placed in the hypervisor and the administrator of the host system.

Right now, there are 3 flavors of SEV:

  1. AMD SEV: Encrypt in-use data
  2. AMD SEV-ES (Encrypted State): Encrypt VM register state (Protect data in memory)
  3. AMD SEV-SNP (Secure Nested Paging): Provide strong memory integrity protection

Currently, our machines EPYC2 (Rome) have SEV enabled and SEV-ES can be enabled with a custom kernel. Whether we need a full-fledged SEV-SNP is something to be evaluated. SEV-SNP is available on EPYC2 (Milan) servers.

To have SEV functionality, we need the following requirements to be met:

  1. Agent providing secret: For each VM one agent (HW token, separate machine, etc.) knowing a secret whose disclosure is equivalent to SEV break
  2. Storage encryption: Each VM needs to encrypt its persistently stored data.
  3. Verified boot: Chain to verify that VM runs authorized software.
  4. Signed software builds: Authenticate the software towards verified boot.

High-level tasks:

  1. Bring up a QEMU image on an existing SEV machine and understand the tooling.
  2. Bring up an IC-OS node with AMD SEV
  3. Encryption of persistent storage
  4. Attestation (Does SEV-SNP help such that we don’t need an external attestation provider?)
  5. Verified Boot

Acceptance criteria:

  1. Keys and canister state have confidentiality and cryptographic integrity protection
  2. Ensure that host and guest isolation exists
  3. Only attested software is run as the Guest
10 Likes

There are many issues with the security of secure hardware enclaves. What mitigations for side channel attacks are being considered?

2 Likes

Updated Summary and relevant people in the intro

2 Likes

Will this provide cryptographic guarantees that the replica software running from within the enclave has not been tampered with? Will we be able to know that node operators have not installed a modified replica? I’m especially thinking about MEV (miner extractable value), and wondering if these kinds of attestations would be able to prevent MEV (which I think would be a very good thing for the IC).

4 Likes

Will this provide cryptographic guarantees that the replica software running from within the enclave has not been tampered with?

I believe that is the intent.

Will we be able to know that node operators have not installed a modified replica?

Good question. I do not know.

@khushboo-dfn1 can you confirm?

2 Likes

That’s correct @lastmjs. The goal is to provide attestation to identify any tampering of the replica by a malicious actor.

4 Likes

@lastmjs: Thanks for the questions, some answers given already, let me add a few.

Side channels: it is ultimately impossible to prevent a malicious actor with control over the host system to trace certain aspects of enclave execution through side channels. We can and will apply counter-measures such as “data-independent code and memory access patterns” (e.g. for cryptographic primitives to avoid exposing secret key material as side channel patterns) but we must admit that the applicability of such techniques is narrow and cannot protect the system in its entirety.

Tampering: We will have cryptographic guarantees against tampering with either the software or the data in storage (and of course also encryption at rest). This also includes software upgrades.

7 Likes

I wonder if node shuffling would help to prevent side channel attacks. The fact that node assignments to subnets are essentially static means a malicious node operator has a possibly indefinite amount of time to perform a side channel attack on an identified hosted canister. If nodes were to randomly rotate amongst subnets, could this prevent certain side channel attacks?

I think node shuffling is paramount for security, and side channel attacks on enclaves might be one enhancement they provide.

7 Likes

Shuffling would only help if moving between machines of different authority, i.e. different data centers. As it stands however, in all likelihood every data center will contribute at least a single machine to each subnet (more or less). That means shuffling would not help.

1 Like

Won’t there be more data centers than replicas in subnets? What are we thinking the end number of data centers will be, 100s or 1000s? And subnet replication is really low right now, 7-34 (I’m hoping it can be pushed up into the 100s or 1000s).

Also, there are independent node operators within data centers I believe. So shuffling among node operators could prove helpful. Also, I am not sure on the nature of side channel attacks, but I imagine they would be rather targeted to one or a few machines at a time, thus even if you moved a canister to another machine in the same data center, you might save it from an attack in progress (though the canister it was replaced with would then be attacked, but who knows if the attack would need to be specific to the canister).

4 Likes

There is no reason to believe that a side channel attack would be limited to a single machine, rather than be deployable on an identified target. Randomized moving will only help you if an attacker-in-waiting has no way of knowing when to strike, however canister/subnet/machine membership needs to be public information for routing purposes.

2 Likes

This probably goes without saying, but solving this some satisfactory degree is very important to any number of Enterprise use cases. Cloud computing has demonstrated the Enterprise’s ability to shift their philosophy about where their data sits and who has access, but it hasn’t been an easy or fast transition.

It is possible for me to break into an azure data center and access Enterprise data, but it is highly unlikely. Microsoft could access the data but they have significant financial reasons not to do so and to protect it at all costs.

All of this to say that the more roadblocks that can be put up, the better. Perfection doesn’t have to be achieved. If node shuffling helps in some scenarios then it is helpful(not to mention the other security advantages that @lastmjs has mentioned elsewhere).

On the financial side, perhaps these enclave operators need another layer of financial incentive to protect against compromise? In an ideal world you wouldn’t want them securing more financial value then they have at stake. That seems hard to maintain.

4 Likes

Has anyone looked at multi-party computation in addition to secure enclaves? From what I’ve learned, that would be a killer combination, perhaps providing BFT guarantees on data privacy, or some kind of threshold guarantee. Some threshold of node operators would have to perform a side channel attack to be able to decrypt the multi-party computation and access the private data. Seems the hardest part is scaling MPC?

6 Likes

Where are things now regarding this topic? Timeframe for moving forward? Seems the combination of E2E encryption would help mitigate loss from attacks but that is helpful for data at rest. We need to get to a place where on-chain analysis of sensitive data is possible with minimal risk. This is absolutely necessary for many enterprises (like mine which is healthcare) to leverage the distributed compute power and other features of the IC.

6 Likes

Thank you for asking. I have been meaning to post an update.

The current status is that the team is still working and investigating options (including trying existing tools and systems), but it is still too early for anything concrete. There are other projects where the lack of updates signifies that teams working on it may be pulled away to other priorities, but this is actually one of the cases where the team is still hard at work, but do not feel they have made concrete solid enough progress yet.

3 Likes

I would love to get more insight into the history of secure enclave/TEE development within DFINITY. This tweet makes it seem like the feature was nearly ready 1 year ago: https://twitter.com/dominic_w/status/1304576423705767936

This seems to happen often, where these tweets seem to show excellent progress on a feature but then when digging in the feature seems to be barely working or under development.

What’s going on?

4 Likes

Totally agree. Again the e2e solution that Timo presented for recent hackathon is the only real advancement that i’ve seen with regard to secure data sharing linked to II. Wonder if the next big advancement here will come from a hackathon as well.

I hope figuring out this issue is mission critical for dfinity because if they don’t, enterprise will not move from traditional stack/cloud because they would be assuming a higher risk of data exposure then their current state.

Would love to have more detailed insight into where things are currently.

2 Likes

I do recall the one demo Dom mentioned in the tweet. Here is my recollection:

  1. There was work for this in Fall 2020 (but not production-tested)

  2. Engineers were pulled off this project to focus on the one priroity: launch the network. Many projects and features rightly halted to focus on launching the network which became the #1 priority within the org. Launching of course also meant “have sufficient testing and reliability to make sure it never goes down” which required lots of teams and projects like “disaster recovery” (which came in handy this weekend and was very complex to sufficiently test). The org increasingly hyper-focused from Q2 2020 to Genesis.

  3. Since Genesis in May 2021, some projects have begun to pick up steam as the network becomes more stable and performant (and confidence grows).

8 Likes

Love this inside look, thanks!

6 Likes

You are very much welcome!

6 Likes