AMD SEV Virtual Machine Support

Summary

Enable node images to be run as virtual machines, improving data center adoption while continuing to support privacy-protecting subnets.

Status
In Progress

What you can do to help

  • Ask questions
  • Propose ideas

Key people involved
@khushboo-dfn1 , @bogdanwarinschi , @jplevyak , @hcb , Rudiger Kapitza

Relevant Background

We need to enable our node images to be run as virtual machines to aid data center adoption and support privacy-protecting subnets. AMD SEV will allow us to do this with minimal R&D effort. It can provide confidentiality against a potentially buggy/malicious hypervisor and other code that may happen to coexist on the physical server. AMD SEV allows transparently encrypting the memory of each VM with a unique key and provides an attestation report that can help VM’s owner verify that the state is untampered and is being run on a genuine ADM SEV platform. This is especially relevant for IC, where IC-OS is deployed on remote servers that are not under our control. It can reduce the amount of trust needed to be placed in the hypervisor and the administrator of the host system.

Right now, there are 3 flavors of SEV:

  1. AMD SEV: Encrypt in-use data
  2. AMD SEV-ES (Encrypted State): Encrypt VM register state (Protect data in memory)
  3. AMD SEV-SNP (Secure Nested Paging): Provide strong memory integrity protection

Currently, our machines EPYC2 (Rome) have SEV enabled and SEV-ES can be enabled with a custom kernel. Whether we need a full-fledged SEV-SNP is something to be evaluated. SEV-SNP is available on EPYC2 (Milan) servers.

To have SEV functionality, we need the following requirements to be met:

  1. Agent providing secret: For each VM one agent (HW token, separate machine, etc.) knowing a secret whose disclosure is equivalent to SEV break
  2. Storage encryption: Each VM needs to encrypt its persistently stored data.
  3. Verified boot: Chain to verify that VM runs authorized software.
  4. Signed software builds: Authenticate the software towards verified boot.

High-level tasks:

  1. Bring up a QEMU image on an existing SEV machine and understand the tooling.
  2. Bring up an IC-OS node with AMD SEV
  3. Encryption of persistent storage
  4. Attestation (Does SEV-SNP help such that we don’t need an external attestation provider?)
  5. Verified Boot

Acceptance criteria:

  1. Keys and canister state have confidentiality and cryptographic integrity protection
  2. Ensure that host and guest isolation exists
  3. Only attested software is run as the Guest
5 Likes

There are many issues with the security of secure hardware enclaves. What mitigations for side channel attacks are being considered?

1 Like

Updated Summary and relevant people in the intro

1 Like

Will this provide cryptographic guarantees that the replica software running from within the enclave has not been tampered with? Will we be able to know that node operators have not installed a modified replica? I’m especially thinking about MEV (miner extractable value), and wondering if these kinds of attestations would be able to prevent MEV (which I think would be a very good thing for the IC).

1 Like

Will this provide cryptographic guarantees that the replica software running from within the enclave has not been tampered with?

I believe that is the intent.

Will we be able to know that node operators have not installed a modified replica?

Good question. I do not know.

@khushboo-dfn1 can you confirm?

1 Like

That’s correct @lastmjs. The goal is to provide attestation to identify any tampering of the replica by a malicious actor.

3 Likes

@lastmjs: Thanks for the questions, some answers given already, let me add a few.

Side channels: it is ultimately impossible to prevent a malicious actor with control over the host system to trace certain aspects of enclave execution through side channels. We can and will apply counter-measures such as “data-independent code and memory access patterns” (e.g. for cryptographic primitives to avoid exposing secret key material as side channel patterns) but we must admit that the applicability of such techniques is narrow and cannot protect the system in its entirety.

Tampering: We will have cryptographic guarantees against tampering with either the software or the data in storage (and of course also encryption at rest). This also includes software upgrades.

6 Likes

I wonder if node shuffling would help to prevent side channel attacks. The fact that node assignments to subnets are essentially static means a malicious node operator has a possibly indefinite amount of time to perform a side channel attack on an identified hosted canister. If nodes were to randomly rotate amongst subnets, could this prevent certain side channel attacks?

I think node shuffling is paramount for security, and side channel attacks on enclaves might be one enhancement they provide.

3 Likes

Shuffling would only help if moving between machines of different authority, i.e. different data centers. As it stands however, in all likelihood every data center will contribute at least a single machine to each subnet (more or less). That means shuffling would not help.

Won’t there be more data centers than replicas in subnets? What are we thinking the end number of data centers will be, 100s or 1000s? And subnet replication is really low right now, 7-34 (I’m hoping it can be pushed up into the 100s or 1000s).

Also, there are independent node operators within data centers I believe. So shuffling among node operators could prove helpful. Also, I am not sure on the nature of side channel attacks, but I imagine they would be rather targeted to one or a few machines at a time, thus even if you moved a canister to another machine in the same data center, you might save it from an attack in progress (though the canister it was replaced with would then be attacked, but who knows if the attack would need to be specific to the canister).

1 Like

There is no reason to believe that a side channel attack would be limited to a single machine, rather than be deployable on an identified target. Randomized moving will only help you if an attacker-in-waiting has no way of knowing when to strike, however canister/subnet/machine membership needs to be public information for routing purposes.

1 Like

This probably goes without saying, but solving this some satisfactory degree is very important to any number of Enterprise use cases. Cloud computing has demonstrated the Enterprise’s ability to shift their philosophy about where their data sits and who has access, but it hasn’t been an easy or fast transition.

It is possible for me to break into an azure data center and access Enterprise data, but it is highly unlikely. Microsoft could access the data but they have significant financial reasons not to do so and to protect it at all costs.

All of this to say that the more roadblocks that can be put up, the better. Perfection doesn’t have to be achieved. If node shuffling helps in some scenarios then it is helpful(not to mention the other security advantages that @lastmjs has mentioned elsewhere).

On the financial side, perhaps these enclave operators need another layer of financial incentive to protect against compromise? In an ideal world you wouldn’t want them securing more financial value then they have at stake. That seems hard to maintain.

1 Like

Has anyone looked at multi-party computation in addition to secure enclaves? From what I’ve learned, that would be a killer combination, perhaps providing BFT guarantees on data privacy, or some kind of threshold guarantee. Some threshold of node operators would have to perform a side channel attack to be able to decrypt the multi-party computation and access the private data. Seems the hardest part is scaling MPC?

1 Like