Let's solve these crucial protocol weaknesses

As a bit of food for thought, here’s a brief overview of an idea we (in and around the Message Routing team) have been throwing around to significantly increase (data and computation) throughput on the IC. It is just an idea, with a couple of big question marks and probably a bunch of pitfalls we aren’t even aware of quite yet. And if it does turn out to be feasible, it will require huge effort to implement.

Finally, we’re not claiming it is entirely our idea and that no one else has had more or less the same thoughts. These kinds of things grow organically and directly or indirectly influence each other.

The basic idea is nothing revolutionary and it has even been brought up a few times in this thread: don’t put the full payload into blocks, only its hash; and use some side channel for the data itself. That side-channel could be IPFS running within / next to the replica; the replica’s own P2P layer; some computation running alongside the replica (e.g. training an AI model on a GPU); and so on and so forth. Once “enough” replicas have the piece of data, you can just include its hash / content ID in the block, the block is validated by said majority of replicas (essentially agreeing that they have the actual data) and off you go.

You could use this to upload huge chunks of data onto a subnet; transfer large chunks of data across XNet streams; run arbitrarily heavy and long (but deterministic) computations alongside the subnet; and probably a lot more that we haven’t even considered.

The obvious issue here is that combining many such hashes into a block (or blockchain) could make it so that not enough replicas have all of them. E.g. imagine a 13 replica subnet; and a block containing 13 hashes for 13 pieces of data; each replica has 12 pieces of data and each replica is missing a different one. So going about this naively (and e.g. including all 13 hashes in the block) you’ll end up stalling the subnet until at least 9 of the replicas have collected all the data needed to proceed. This may easily be minutes of latency before the block can even start execution. Even being exceedingly cautious and saying “at least 11 of the 13 replicas must have all data” can lead to the other 2 replicas constantly state syncing, never able to catch up. And what if 3 replicas are actually down? I’m sure there’s a way forward (again, implying some trade-off or other – liveness, latency, safety), I just haven’t found it. The best I have so far is to use this as an optimization: if “enough” replicas have all the data, they propose and validate a block containing just hashes; if not, they continue including the actual data into blocks and advance with much reduced throughput (as now).

As said before, if you have ideas on how to make this work; or completely different ideas of your own that you want to talk about; please start a separate thread. I’ll be happy to chat. Just keep in mind that there’s a long way from idea to working implementation; and that there’s always – always – a trade-off and a price to pay: sometimes it’s worth it, more often than not it’s not.

11 Likes