Let's solve these crucial protocol weaknesses

Hi @lastmjs , thanks for summarizing the list of weaknesses of ICP from your twitter post on this forum. I don’t think Twitter is a good venue for in-depth discussion, so I’m glad you moved them here.

While others may be able to make suggestions to address these weaknesses, I want to digress a bit and focus on the background of your twitter post, namely, whether AO is a sound protocol that allows for infinitely scalable (and secure) computations without the trade-offs made in ICP.

But first, I’ll take a detour in a completely different direction, and discuss how exactly we “Don’t trust, Verify”. Please bear with me, and you will understand why I want to do this before explaining what AO is about.

IMHO, blockchain is all about verification. So “how to verify” is the no.1 crucial thing when it comes to understanding a blockchain protocol. “How to verify” certainly has evolved over the years:

  1. Run a full node to sync all block history since genesis, and for each block, verify inputs in it, perform required computation, and verify outputs. Note that a full node is also required to participate in blockchain consensus, where it is expected to perform both verification and computation (in addition to agreeing on the ordering of inputs).
  2. But running a full node is very demanding, and ordinary users do not have resources to run them. Instead, when a user only wants to verify, they can run a light client to sync all block headers since genesis (or a recent and publicly known checkpoint), and for any received data that requires verification (which can be a full block or part of a block with merkle proof), check whether the merkle hash matches known block headers. Bitcoin SPV client would also compute utxos for the user, but Ethereum light client only verifies data without running actual smart contract computations. So a light client is assuming that a verified block has the consensus of the majority of full nodes in a blockchain, and the result of computation is correct.
  3. ICP goes one step further by reducing the implementation of a light client down to verifying a BLS signature. If the signature checks out, it is assumed that the result has the consensus of all full nodes in a blockchain (namely, one IC subnet), and the result of computation is correct.
  4. Layer 2 (L2) roll-ups took a different direction when it comes to verification, because it can seek the help from a smart-contract enabled layer 1 (L1). Instead of running a consensus protocol and replicated computations, L2s usually run a centralized server to process user transactions. The results (after running through many blocks on the central server) are “rolled-up” into a short evidence and put on L1 for verification.
    • For zk-rollups, the verification is done by a smart contract on L1 checking zero-knowledge proof against the evidence.
    • For optimistic roll-ups, the verification is only required when a challenger submits a challenge to the L1 smart contract, and the L1 smart contract will re-compute everything and verify if the evidence from L2 was correct. It is assumed that the L2 node had already staked tokens in the L1 contract, and would be punished if the verification fails. So there is incentive not to be malicious. Lack of challenge for a certain (and usually long) period of time is silently taken as a positive signal of “being verified”.
  5. Besides roll-ups, there are other layer 2 protocols (Ordinals, Ethscription, etc.) that avoid running consensus. They usually run off-chain computation with inputs taken from block data from an L1. More often than not, such inputs are not verified before they are admitted into a L1 block, because the L1 lacks capability to perform required verification. So it is up to the L2’s off-chain computation to decide some inputs are legit and others are not, and only compute results based on “correct” inputs. These protocols still offer verifiability by asking users to run a full node + indexer. Due to version differences and bugs, different people running their own choice of indexers may arrive at different state even when given the same set of inputs. So it sometimes requires extra work and social consensus to resolve conflicts.
  6. Yet still, there are other protocols running off-chain computation and do not require running a full node. They are usually point-to-point protocols where it is sufficient (with the help of a L1 light client) to verify a transaction with only data presented by parties involved in a transaction. Examples are payment channels (lightning protocol), RGB protocol, and so on. They do still specify their own methods of verification, and offer analysis on why the protocol is secure.
  7. Last but not least, when it comes to cross-chain communication, verifiability also plays a crucial role. Usually this is in the form of a “bridge” (where the only communication is token custody), with smart contracts on both chains, and each securing assets on respective chains. They would need to verify if a transfer request from the other chain is “authentic”. This is usually done by running either full nodes or light clients, and often a 3rd chain is introduced in-between because consensus is needed to reduce the work required for smart contracts (which are often less capable when compared to off-chain computation). There are many examples, polkadot, cosmos, and even ICP fits in here, except that ICP greatly reduces the complexity of verifying inter-subnet communication down to verifying signatures.

Now let’s look at AO, which is still a project in its early stage with a not-so-comprehensive spec, and much still in flux. I had the opportunity to chat with one of AO’s founders in a WeChat group. After some heated debates, here are my takes. Please take this with a big grain of salt because so much hasn’t been specified at all, and they are all subject to change.

  • AO allows any deterministic computation to be taking place off-chain in the form of “a process”, where the chain is Arweave (AR) with only decentralized storage but no on-chain smart-contract capability. So AR cannot offer computation verification like some other L1s. All inputs to a process must first be recorded on AR, so there is a permanent record of all historic inputs.
  • One or more CUs (computing units) work together to perform the computation required of a process, taking inputs to outputs, where outputs are put on AR as well. They do not run a consensus protocol (or at least they are not required to run consensus), so their computation results may not be “trust-worthy”. Yet still, such results or outputs are recorded on chain, and can be verified if anyone chooses to do so (we’ll see about the “how” a bit later).
  • However, things become tricky here because when a process A receives an input from another process B, even though the input is recorded on chain, it is not immediately clear that the input from B can be “trusted” by CUs computing for process A. The unofficial answer I get from AO team was that CUs computing for process A do not verify any input except that checking if they are already recorded on AR.
  • This may lead to a situation where CUs for A records “wrong” output on chain from running process A, either because of bug, or them being malicious. But this can be remedied by a optimistic verification mechanism via staking and slashing.
  • It is assumed that a special group of CUs run a process that manages staking & slashing, and they would respond to challenge requests from anyone, and they will check if the recorded outputs of a process is really computed from its inputs. Needless to say, they would run through all input histories of a process in order to verify, because no intermediate state was saved anywhere.
  • Now the remaining question is who would challenge. It is assumed that the project team running CUs for process A would have an interest in at least running one CU for process B as well, because A’s correctness depends on B’s output. Also because AO is an open-membership protocol, anyone is free to run any CU. According to unofficial discussion, A’s stake would also be slashed if A ever accepts a “wrong” input from B without verification. So there is even greater incentive for A’s team to help with verifying B. So if we expand this logic, teams would cross-verify each other if their project has dependencies.

If I’m allowed to make an analogy, AO is basically a distributed network of off-chain indexers, each responsible for themselves (by computing for their own process, and also by offering to run CUs for processes that may give them inputs). All events (inputs & outputs) are permanently recorded on a storage chain AR. AO’s security relies on optimistic challenging + staking + slashing, which is managed by a group of CUs instead of a smart-contract enabled Layer 1.

So, is AO’s computation verifiable? Since everything is recorded, and computation is assumed to be deterministic, then yes it is.

How much of an effort will it take to fully verify the computation of a single transaction? I think it is equivalent to running a full node that syncs all histories of all processes. First of all, every CU must stake, so to verify their respective membership, you need full state of the special staking manager process, and I’ll assume this will be the main token ledger of AO. Then almost all processes will take input from or give output to the main ledger, so they all becomes dependencies of each other. Therefore a full verification requires computing for all dependencies, which will eventually involve the entire global state of AO. Given that AO’s goal is to “absurdly scale”, this kind of full verification would be impossible to achieve.

Now that running full node verification is out of question, can we run a light client equivalent for verification purpose? This would be the verification of the recorded outputs of a single process against its recorded inputs, assuming all recorded cross-communication between processes are correct.

But this is a very big assumption to make, so big that I’m not even sure it is secure any more. Just compare this to cross-chain communications as noted in the above point 7 when I discussed verification. It is immediately obvious that AO’s design took a radical approach when it comes to verifying the equivalent of “cross-chain messages”. AO’s processes are not even chains, since there are no consensus. When protocols like polkadot and cosmos took extra care in designing a secure cross-chain message exchange mechanism, AO leaves it to optimistic challenge & slashing. When optimistic Layer 2s like Arbitrum are extra cautious to insist a 7-day withdrawal delay in order to limit the potential damage of “wrong outputs”, AO wants every inter-process message to exchange immediately since they can be “verified later”.

It is also unclear whether the CUs in this stake managing process would run a consensus protocol. If not, their own computation would require challenging as well, which falls flat due to the recursion. AO is a young protocol making bold claims, and I hope they can pay more attention to verification and its practicality, because they directly affect security.

My conclusion so far is that ICP chose to make some trade-offs in order to be scalable, AO chose to make a different set of trade-offs in order to be scalable. Both are yet to be entirely proven to be practical, but at least we have some assurance that when an IC canister receives a message from another canister, it is “secure” within the safety parameter of a consensus protocol (of a subnet). With AO, it is “optimistic”, remember?

PS: it is also unclear how a “roll-back” would work when mismatch is discovered and various stakes got slashed. The AO team roughly mentioned something like multi-versions or branches could co-exist, but I’m not sure when and how a re-computation is triggered, because they would necessarily follow the dependency order. It is almost like branching off of an old check point, but you don’t actually know when or where since the team insisted that “AO has no global state”.

67 Likes