Thanks everybody for the insightful comments! I understand that it’s a lot of effort to write all these thoughts down in detail, but I think this really helps for the discussion. To keep an overview, so far people brought up the following possible advantages of shuffling:
-
node operator collusion (a majority of the nodes of a subnet are malicious and tamper with the subnet state)
-
security as a network effect (very related to the point above)
-
miner extractable value / front running
-
attacking enclaves
-
censorship resistance
I’ll first reply to the ones that I find less convincing, and then comment on the ones that i think are the best arguments in favor of shuffling.
miner extractable value / front running
A block maker in any blockchain can order messages in any way they like. That is because we haven’t agreed on an ordering yet, we are placing this on a blockchain to reach agreement on an ordering. This means that a malicious block maker can always try to order messages in an order that is favorable for itself. A concrete example of such an attack is if the block maker sees a big sell order to some dex coming in, it can quickly place a sell order itself first, and make sure that appears in the block before the other sell order. Note that this attack only requires a single malicious block maker, it does not need a collusion between multiple node operators, because the MEV block maker will still propose a block that only contains valid messages, so all other replicas will accept this block. They cannot know that the MEV block maker is actually front running, because there is no agreed-upon ordering of messages yet.
Another solution is…you guessed it, node shuffling! If as a node operator you never know which subnet you’ll belong to, and thus which canisters you’ll be hosting, it would become difficult to install the correct modified replica to take advantage of MEV. And 2/3 of the other nodes in the subnet would also need to install this software, so even if you convinced a cohort of buddies to join you in running the modified replica, you’d have to hope you’d all be shuffled into the same subnet that hosts the canisters the modified replica is designed for.
See my point above, this attack can be done by a single malicious replica, you don’t need a majority. The idea that the node operator/replica doesnt know which subnet it belongs to and which canisters it hosts is very difficult to realize: the replica must only accept messages to canisters that are hosted on the subnet you belong to, so you couldn’t properly validate blocks if you dont know which canisters you’re running.
How can this be mitigated? Secure enclaves is one solution, assuming we can get attestations from the enclave that the replica has not been tampered with.
I think this is a promising way to address MEV, and the foundation is already investigating how this can be done. So if we run this in some TEE and require attestations, it will be way harder to run a malicious block maker. Other countermeasures can be taken in the dapp itself. For example, if you are a dex, you could think of batching transactions and using secure randomness to shuffle transactions and execute them in a random order, making front running much harder.
Attacking enclaves
That’s an interesting point that I didn’t think about yet. I’m not an expert on these things, but I would imagine that best way to get info out of the TEE via side channels is to target the encryption key via side channels, not to attack canisters directly. Changing subnets would not change your TEE, so then shuffling wouldn’t “reset” the side channel attack. I’ll try to get some of the experts involved in this one.
censorship resistance
Right, so an attacker targeting a certain canister can figure out which subnet that canister is one and attack nodes from that subnet. @MalcolmMurray, what type of attack are you thinking of here? And how quickly do you imagine nodes would be shuffled? If a couple of nodes are swapped out say every hour or so, and the attack is just a simple DoS attack, then I’m not sure shuffling helps: an attacker could easily keep up with the subnet membership changes because it’s easy to target a new node with a DoS attack. For attacks that take along time to perform, shuffling could indeed help.
Now, I think the main ones are left.
node operator collusion and security from the network effect
This is a super important one and I think the core of this discussion. The argument of “security from the network effect”, so the bigger the ICP grows, the more secure it becomes, is obviously super compelling.
So the internet computer could constantly switch nodes between subnets, such that for example, it makes sure that a node never spends more than 1 week on a given subnet. If we further assume that attacking a node takes > 1 week, then the adversary can no longer target specific canisters/subnets, which is a nice property to have.
Designed appropriately, I think the laws of probability would prevent this exact situation from happening. I remember earlier DFINITY materials which explained that with sufficient numbers of committee membership, the random beacon driving probabilistic slot consensus or some other complicated mechanisms would ensure with outrageous (I think the more correct term is overwhelming) probability that no individual committee would be more than 1/3 malicious. The committee size was something like 400, and committees were chosen out of the total number of nodes participating in consensus. Each committee was in charge of validating for a certain period of time or something like that.
Right, you’re referring to our old whitepaper. This is where things get more tricky. We now essentially sample a new subnet membership every week, and here we want to then draw from the security of the overall amount of nodes. Suppose our total pool of nodes is very large (so i can get away with doing binomial distributions). Let’s do some computational examples. Suppose up to probability_corrupt
of all nodes are corrupt, we sample a subnet of size subnet_size
, then we compute (using binomial distribution) how likely it is that we select a subnet with more than 1/3rd of the nodes being corrupt, which i call p_insecure
. Below you see some examples, here is the google sheet i used.
So what you see is that if for example we assume 1/10th of all nodes in the IC are malicious, and we randomly select a 28 node subnet, there is a 2^-12 probability that the subnet is unsafe, because more than 1/3rd of the nodes are corrupt. This is a small chance, so that is good, but if we regularly choose new subnet members, then every time the sampling needs to be successful. If we have 50 subnets and do this every week, we do this ~2500 times per year, and each one of those must be successful. I think this is the main price of the reshuffling. If we don’t reshuffle, we don’t need to get it right 2500 times per year, but just once, which is obviously a better chance, so theoretically we could tolerate more corruptions overall. How do people feel about these numbers?
So to recap, based on these numbers, I think we can say:
- shuffling nodes is nice against “adaptive” adversaries that target one subnet, assuming it takes some time to corrupt a node
- shuffling nodes is not so nice in the sense that we select new subnet memberships every time, and each time, we must select a secure set of nodes. We can get unlucky by combining too many malicious nodes, that can then collude to break the subnet
- static subnets are weak against adaptive adversaries
- static subnets are better against static adversaries