Shuffling node memberships of subnets: an exploratory conversation

Summary

The shuffling of node membership in subnets has been brought up a few times by the community as a way to address security risks within the IC. Truth be told, the researchers at the DFINITY foundation have engaged this topic sometimes within other topics, but I do not think we have had a deep, focused conversation with the community on this. We want to correct this.

That is why we decided to create this thread on /roadmap path to understand and discuss with folks as an exploratory conversation and assign a researcher to poll folks and focus on it.

For this conversation, Manu Drijvers (@Manu) will be the lead representing the understandings from DFINITY R&D and the owner of this conversation. Manu is a cryptographer and the engineering manager of the DFINITY Consensus team.

He will both take back to the R&D team the community’s ideas as well as help communicate. He will help bridge the gap if you will. In the past, this conversation has been too scattered across many mediums and people so it did not have time to breathe.

Related links:

How you can help

  • Propose your ideas
  • Ask questions
  • Like comments
11 Likes

I guess one of my first question is why not across subnets?

5 Likes

Thanks for kicking this off @diegop!

As Diego mentioned, I would be super curious to hear the community’s ideas and opinions on this topic. Currently, the internet computer does not regularly change subnet membership, while we’ve heard some people suggest that regularly swapping nodes in and out of a subnet. I’d love to hear what people prefer (more static vs more dynamic subnet membership) and how frequently they think nodes should be rotated. I’d also be curious to hear what type of attacks they are most concerned about and how they could be avoided using such mechanisms.

I guess one of my first question is why not across subnets?

@skilesare, I think Diego meant switching up subnet membership, so essentially “across subnets”: a node could be taken out of one subnet, and added to another subnet.

5 Likes

This is exactly right. I have edited the post and title to better reflect the intent.

Thank you @skilesare !

2 Likes

Ethereums fast shuffling requires a stateless client model, where all the state stored in a shard/subnet is authenticated with merkle trees and then witnesses are used when carrying out new computations.

A question I have is how long would it take for a node to download all the state of a subnet in Dfinity. Thus will impose practical limits on how fast shuffling can be.

Will you guys consider a hybrid model?

4 Likes

Yes, please offer us more information on the bandwidth and computational requirements of spinning down a node from one subnet and spinning that same node up in another subnet. That will help inform us on the resource hit node shuffling would provide.

4 Likes

A slight nudge

If I may be so bold as to nudge the conversation a bit in this direction…

I think the intent of the shuffling memberships is very important to drill down and get clarity & consensus on. This sounds like a tautology so allow me to explain.

At DFINITY, I have been very fortunate to work with world-class cryptographers (not my field of expertise)… and one of the results of this interaction is that for a while we were talking with a different mental model. The main mental model is that cryptographers speak
in “attacks” and “breaking X.”

To satirize, a common conversation with cryptographer can go like this:

Example: 1
Diego: "Is protocol X secure?

Cryptographer: “No, there is a paper that shows that if someone does {{describes the world’s most complicated attack}} then the protocol is broken. So it is considered Broken.”

Diego: “Oh”

Example: 2

Diego: “Is protocol Y secure?”

Cryptographer: “All I know is that there is no known attack to break it that is easier than brute force over 20 years”

Diego: “Is that the crypto version of YES?”

Why it matters in this case

It matters because one motivation Jan, @Manu , and I had for creating this thread is that we realized that (being intellectually honest) we were not sure what the attack node shuffling was addressing. This could very well be we have not been paying close enough attention to the community (which we want to recifty) and it’s obvious to other folks, but it was not to us.

Allow me to present some possible candidates of what we thought we heard:

  1. The attack is that someone wants to tamper with canister X.
  • The idea that node shuffling will make it harder to tamper with canister X by having nodes collude. In this case, if canister X llves in a subnet Y with 7 node providers (but the IC has 100 node providers), then making a conspiracy among the 7 node providers is harder than making a conspiracy among 100 node providers. This is a reasonable take. The next follow-up would be: if this is the goal, why don’t we move the canister around instead of the nodes? Not saying this is easier than node membership, but it shows how we were not sure of the attack being mitigated. If the attack were clear, different creative solutions could be presented.

2. The attack vector is to corrupt subnet Y.

  • In this case, shuffling canisters does not help, it is about corrupting a particular subnet. Then the question I had was this (I admit my thinking is primitive here): If we assume that 1/3 of the nodes in a subnet are malicious… and node memberships are being shuffled constantly, then won’t subnet Y eventually end up with a supermajority of malicious nodes? Could shuffling produce a situation where some subnets have 0 malicious nodes and some subnets have 2/3 malicious… and by the law of probability, eventually end up corrupting subnet Y.

3. The concern is less “an attack” and more a concern about the number of nodes in most subnets. The goal is to increase resilience.

  • In this case, node shuffling is seen as a way to boost the nodes that need to agree. If this is the case, is node membership a quicker solution to increasing the node count by tapping all of the node providers in the IC?

4. All of the above!

  • Very possible it is all of the above. This is why we started the thread: Honest recognition we did not know the attacks to address. If we knew, then we could see if there are easier ways to address them… or inversely, it could be there may already be easier attack vectors that need to be prioritized higher. We were not sure, to be honest.

Hope that makes sense!

11 Likes

Curious if @Manu agrees with what I wrote above.

Always very possible I took away something different than what he intended when he and I chatted. :upside_down_face:

3 Likes

I think it is likely that the community member all have to go through a process of rectifying confusion around what a node is vs what a canister is. I myself had a bit of a brain shift reading your post. I wasn’t thinking clearly about it and I had specific issues I was most concerned with that were clouding my thinking. Great post!

I’m concerned about one malicious actor having infinite time to rip open a canister and look at the data in my canister. This is a problem until we have(and with side-channel attacks, maybe even after) we have secure enclaves. This is #1 in your post.

#2 seems like maybe it could just be fixed by some transparency and some ability for the user to pick their subnet? Maybe not…and I probably don’t understand the node/subnet distinction well enough to comment.

#3. Resilience is great and should be a long-term goal. I remember seeing some early designs where users were going to get to pick the number of nodes that would run their canister(with adjusting costs). I understand that was removed for now to reduce complexity, but I can see where that would be a valuable user choice to have.

6 Likes

I do! I’m hoping to get a discussion around what type of attacks people are afraid of and what type of attacks shuffling membership could help avoid.

I think it is likely that the community member all have to go through a process of rectifying confusion around what a node is vs what a canister is.

Good point, let’s start with clear terminology.
Canister smart contracts (or just “canisters”) are the pieces of software that run on the internet computer. The internet computer consists of many subnets, and the canisters are distributed over the different subnets. That means that every canister on the internet computer in fact runs on one of the subnets of the IC.

There are many physical computers powering the internet computer. I call these nodes or replicas. Nodes/replicas are grouped together to form a subnet. So if nodes 1, 2, 3 and 4 together form subnet A, then we say nodes 1, 2, 3, and 4 are the members or subnet A. The members of a subnet together build a blockchain as a means of reaching consensus on which messages for the canisters on this subnet should be executed. The replicated state of a subnet is the state that every member node of the subnet should have, which contains the state of every canister on this subnet.

4 Likes

Great questions. A new node can always catch up to the latest state of a subnet, this is already present in internet computer today. The “catch-up package” (or “CUP”) signs the full replicated state, so a node newly joining a subnet can securely learn the hash of the replicated state. I did a talk on CUPs a while back, you can find it here. From that, there is a state-sync protocol, allowing a node to securely obtain the full replicated state from other nodes on the subnet.

I think the biggest performance cost is that a node that joins a subnet needs to obtain the full replicated state of a subnet. Currently, there is one subnet with a replicated state of 50GB, another one with 30GB, one with 10GB, and many subnets with a replicated state of a couple of GBs. Obtaining 50GB of data from other nodes consumes quite some bandwidth, and that’s bandwidth we could otherwise use for larger consensus blocks (= more update message throughput) or to answer more query calls. The replicated state is currently allowed to grow to 300GB, so things can get even larger.

To calculate an example: if we assume a 300GB replicated state, and nodes have 10Gb/s bandwidth, then a node could download that state in 4 minutes. It can pull different parts of the state from different nodes simultaneously, so if it’s pulling data from 15 other subnet members, they would all only spend ~7% of their bandwidth on helping the new node catch up for the duration of 4 minutes.

So in summary: I would say that we can relatively quickly add new nodes to a subnet and take another one out. If we’re willing to pay a bit of a bandwidth penalty we can change many members of a subnet every day, and overall the approach of shuffling members seems feasible.

Now my question to everybody here: what do you see as the goal of shuffling subnet membership? How exactly is it more secure than not shuffling subnet members?

9 Likes

Ahhh I had completely missed this particular intent. Has nothing to do with consensus but about data privacy while Secure Enclaves are not online.

Thank you @skilesare

4 Likes

I have been working on a little post for the past couple days, but there seems to be far too much to discuss, and I don’t want to try and put it all into one perfectly long blog post here. I’ll just ad hoc add comments

Welcome to my life, Jordan :upside_down_face:

6 Likes

Attack vector: Node operator collusion

One major attack vector is the collusion that is possible between node providers. Node providers are currently publicly known, and even if they weren’t publicly known, it is feasible that the node providers could use other means through personal networks, google searches, relationships with data centers, etc that would allow them to find each other. 7 colluding nodes could delete every canister on the subnet.

The fact that each subnet has a relatively low replication factor (compared with other blockchains) makes it relatively easy for node providers to find each other and prepare for an attack. For example, on a 7 node subnet 3 colluding nodes can halt a subnet, and 5 colluding nodes could perform a potentially undetected attack and have full access to state changes and possibly more (if I am wrong on the math or the capabilities let me know, I think I am generally correct here).

If I am a node provider, I only need to find two other node providers to cause havoc to a 7 node subnet. Obviously increasing the replication factor would help, but shuffling may help us achieve higher levels of security with lower levels of replication, which is ideal for cost and other reasons. And I think it would be wise to add as many feasible mitigations as possible to the Internet Computer so that it can be incredibly secure.

Since subnet membership is basically created once at subnet inception, node providers have an indefinite amount of time to start colluding and preparing for an attack. The fact that canisters are currently running in plain text on the nodes and other information is known about them (I believe you can easily find out which subnet a canister belongs to?), it is relatively easy for node providers to even target specific attacks against canisters.

Hiding the canisters from the node operators I think is a separate problem that can be mitigated with secure enclaves and possibly other technologies or techniques. But even if node operators don’t know what canisters they are running, they can perform an indiscriminate attack with regard to canisters and just attack the subnet. If they’re lucky, they’ll be able to get a juicy reward from canisters within their subnet, and maybe even affect other subnets that are depending on canisters within their subnet.

Though subnets are islands of consensus, it seems very unlikely that one subnet shutting down would not affect other subnets, since canisters will start to depend on other canisters in other subnets.

Shuffling the nodes would help to destroy node operator relationships within subnets. As soon as a relationship were formed, it may just as soon be destroyed.

Now to make this worth it, the network would need many many node operators, the more the better I would think. I will discuss this in further comments.

16 Likes

Attack vector: Miner Extractable Value (MEV)

Another possible attack vector is Miner Extractable Value (MEV), a problem that plagues Ethereum as we speak.

My understanding of MEV is this: Ethereum clients are modified to look for lucrative transactions going to certain smart contracts (for example, Uniswap). The modified clients (let’s call them MEV clients) are designed to find these transactions, reorder them, and place the MEV client owner’s own transaction in front, a transaction that would take advantage of the knowledge gained by viewing the order and gas prices of all transactions targeting certain smart contracts. Front running I believe is the general term for this type of behavior. There may be other forms of MEV, but the above is my understanding.

So basically, a miner can take advantage of its position of knowledge and reordering power to make money. This causes various issues, and IMO is not desirable and something to be avoided. It’s a big problem with Ethereum and they’re having a hard time dealing with it. It seems they’ve basically accepted it as unavoidable and are now dealing with it as a necessary evil.

The question is, would this be possible on the Internet Computer? I believe yes, assuming a subnet had 2/3+ of nodes running a modified replica designed to take advantage of certain canisters.

How can this be mitigated? Secure enclaves is one solution, assuming we can get attestations from the enclave that the replica has not been tampered with.

Another solution is…you guessed it, node shuffling! If as a node operator you never know which subnet you’ll belong to, and thus which canisters you’ll be hosting, it would become difficult to install the correct modified replica to take advantage of MEV. And 2/3 of the other nodes in the subnet would also need to install this software, so even if you convinced a cohort of buddies to join you in running the modified replica, you’d have to hope you’d all be shuffled into the same subnet that hosts the canisters the modified replica is designed for.

I’m not an expert on MEV and my understanding could be off, but as I see it node shuffling would provide an additional layer of protection against MEV. Combine that with secure enclaves that have code attestations, and I believe it would become nearly impossible for MEV to exist on the IC.

11 Likes

Attack vector: Secure enclave side-channel attacks

Secure enclaves will be an excellent addition to the security and privacy of the Internet Computer, helping to hide the execution and storage of canisters from node operators. This will make it hard for node operators to collude, considering if you don’t know which canisters you are hosting it will be hard to perform a targeted attack against an individual canister. It will also make the IC more private, as it will become difficult for node operators to reveal canister data intended to be private and accessed only by authorized parties through the canister function interface.

Unfortunately, based on all of my learning on this subject over the past couple years, the consensus is that secure enclaves are not perfectly secure and probably never will be. There seems to be a catch-all attack vector known as a side-channel attack. Side-channel attacks in the context of secure enclaves are basically indirect attacks that use information such as power consumption, sound, electromagnetic radiation, and possibly other sources of information to read the supposedly “secure” or “private” information within the enclave.

This is where my knowledge really breaks down…I am not sure how long it takes for these attacks to be carried out. But my intuition tells me the attacks would not be simple, and would take a long time to perform, like hours, days, or weeks. Sophisticated equipment may be necessary, and the attack may need to be very specific, like to a canister, so again knowing which canisters are running on a node would not help this situation.

Assuming the above is all true or close to it, node shuffling again helps! As soon as an attacker has all of their equipment set up, the original canister they thought they might be targeting might have been whisked away. And even if the attacker is indiscriminate, the fact that the canisters running within the replica are always changing would not help them determine the patterns that they might need to perform the attack.

Instead of a node operator knowing which canisters are running on their nodes, and having indefinite time to perform a side-channel attack, we could greatly narrow that window, and hopefully make it shorter than a conceivable attack would take.

8 Likes

Attack vector: Uneven distribution of Byzantines, or lack of security as a network effect

This attack vector (if you can even call it that) perhaps embodies the core of my arguments for why node shuffling is necessary for the security of the Internet Computer.

Right now, security is not a network effect of the IC. As more nodes are added to the IC, the IC as a whole does not become more secure. Adding nodes to the IC only increases the throughput of the IC, and/or the security of individual subnets. Each subnet is an island of security, and is only in charge of securing itself. It does not inherit security from the security of other subnets, only in that ingress messages from those subnets would be more secure, and egress messages to those subnets would be more secure.

This is not ideal, because the distribution of Byzantines could become too concentrated within individual subnets, even if the BFT requirements of the IC as a whole are held. Phrased differently, even if 1/3 or fewer of all IC nodes are malicious, individual subnets can currently have more than 1/3 malicious nodes. Individual subnets do not inherit the security of the IC as a whole, they are left to fend for themselves.

With node shuffling, assuming proper randomness that guarantees proper probabilistic distribution of nodes across subnets (I am assuming this is possible), if the IC as a whole had 1/3 or fewer malicious nodes, then each subnet would also have 1/3 or fewer malicious nodes.

Security as a network effect thus seems like a very desirable property to have, and it is lacking from the current Internet Computer.

11 Likes

And here are a few previous discussions on this topic by myself and a few others:

1 Like

Designed appropriately, I think the laws of probability would prevent this exact situation from happening. I remember earlier DFINITY materials which explained that with sufficient numbers of committee membership, the random beacon driving probabilistic slot consensus or some other complicated mechanisms would ensure with outrageous (I think the more correct term is overwhelming) probability that no individual committee would be more than 1/3 malicious. The committee size was something like 400, and committees were chosen out of the total number of nodes participating in consensus. Each committee was in charge of validating for a certain period of time or something like that.

4 Likes