Let's solve these crucial protocol weaknesses

Hi @lastmjs , thanks for summarizing the list of weaknesses of ICP from your twitter post on this forum. I don’t think Twitter is a good venue for in-depth discussion, so I’m glad you moved them here.

While others may be able to make suggestions to address these weaknesses, I want to digress a bit and focus on the background of your twitter post, namely, whether AO is a sound protocol that allows for infinitely scalable (and secure) computations without the trade-offs made in ICP.

But first, I’ll take a detour in a completely different direction, and discuss how exactly we “Don’t trust, Verify”. Please bear with me, and you will understand why I want to do this before explaining what AO is about.

IMHO, blockchain is all about verification. So “how to verify” is the no.1 crucial thing when it comes to understanding a blockchain protocol. “How to verify” certainly has evolved over the years:

  1. Run a full node to sync all block history since genesis, and for each block, verify inputs in it, perform required computation, and verify outputs. Note that a full node is also required to participate in blockchain consensus, where it is expected to perform both verification and computation (in addition to agreeing on the ordering of inputs).
  2. But running a full node is very demanding, and ordinary users do not have resources to run them. Instead, when a user only wants to verify, they can run a light client to sync all block headers since genesis (or a recent and publicly known checkpoint), and for any received data that requires verification (which can be a full block or part of a block with merkle proof), check whether the merkle hash matches known block headers. Bitcoin SPV client would also compute utxos for the user, but Ethereum light client only verifies data without running actual smart contract computations. So a light client is assuming that a verified block has the consensus of the majority of full nodes in a blockchain, and the result of computation is correct.
  3. ICP goes one step further by reducing the implementation of a light client down to verifying a BLS signature. If the signature checks out, it is assumed that the result has the consensus of all full nodes in a blockchain (namely, one IC subnet), and the result of computation is correct.
  4. Layer 2 (L2) roll-ups took a different direction when it comes to verification, because it can seek the help from a smart-contract enabled layer 1 (L1). Instead of running a consensus protocol and replicated computations, L2s usually run a centralized server to process user transactions. The results (after running through many blocks on the central server) are “rolled-up” into a short evidence and put on L1 for verification.
    • For zk-rollups, the verification is done by a smart contract on L1 checking zero-knowledge proof against the evidence.
    • For optimistic roll-ups, the verification is only required when a challenger submits a challenge to the L1 smart contract, and the L1 smart contract will re-compute everything and verify if the evidence from L2 was correct. It is assumed that the L2 node had already staked tokens in the L1 contract, and would be punished if the verification fails. So there is incentive not to be malicious. Lack of challenge for a certain (and usually long) period of time is silently taken as a positive signal of “being verified”.
  5. Besides roll-ups, there are other layer 2 protocols (Ordinals, Ethscription, etc.) that avoid running consensus. They usually run off-chain computation with inputs taken from block data from an L1. More often than not, such inputs are not verified before they are admitted into a L1 block, because the L1 lacks capability to perform required verification. So it is up to the L2’s off-chain computation to decide some inputs are legit and others are not, and only compute results based on “correct” inputs. These protocols still offer verifiability by asking users to run a full node + indexer. Due to version differences and bugs, different people running their own choice of indexers may arrive at different state even when given the same set of inputs. So it sometimes requires extra work and social consensus to resolve conflicts.
  6. Yet still, there are other protocols running off-chain computation and do not require running a full node. They are usually point-to-point protocols where it is sufficient (with the help of a L1 light client) to verify a transaction with only data presented by parties involved in a transaction. Examples are payment channels (lightning protocol), RGB protocol, and so on. They do still specify their own methods of verification, and offer analysis on why the protocol is secure.
  7. Last but not least, when it comes to cross-chain communication, verifiability also plays a crucial role. Usually this is in the form of a “bridge” (where the only communication is token custody), with smart contracts on both chains, and each securing assets on respective chains. They would need to verify if a transfer request from the other chain is “authentic”. This is usually done by running either full nodes or light clients, and often a 3rd chain is introduced in-between because consensus is needed to reduce the work required for smart contracts (which are often less capable when compared to off-chain computation). There are many examples, polkadot, cosmos, and even ICP fits in here, except that ICP greatly reduces the complexity of verifying inter-subnet communication down to verifying signatures.

Now let’s look at AO, which is still a project in its early stage with a not-so-comprehensive spec, and much still in flux. I had the opportunity to chat with one of AO’s founders in a WeChat group. After some heated debates, here are my takes. Please take this with a big grain of salt because so much hasn’t been specified at all, and they are all subject to change.

  • AO allows any deterministic computation to be taking place off-chain in the form of “a process”, where the chain is Arweave (AR) with only decentralized storage but no on-chain smart-contract capability. So AR cannot offer computation verification like some other L1s. All inputs to a process must first be recorded on AR, so there is a permanent record of all historic inputs.
  • One or more CUs (computing units) work together to perform the computation required of a process, taking inputs to outputs, where outputs are put on AR as well. They do not run a consensus protocol (or at least they are not required to run consensus), so their computation results may not be “trust-worthy”. Yet still, such results or outputs are recorded on chain, and can be verified if anyone chooses to do so (we’ll see about the “how” a bit later).
  • However, things become tricky here because when a process A receives an input from another process B, even though the input is recorded on chain, it is not immediately clear that the input from B can be “trusted” by CUs computing for process A. The unofficial answer I get from AO team was that CUs computing for process A do not verify any input except that checking if they are already recorded on AR.
  • This may lead to a situation where CUs for A records “wrong” output on chain from running process A, either because of bug, or them being malicious. But this can be remedied by a optimistic verification mechanism via staking and slashing.
  • It is assumed that a special group of CUs run a process that manages staking & slashing, and they would respond to challenge requests from anyone, and they will check if the recorded outputs of a process is really computed from its inputs. Needless to say, they would run through all input histories of a process in order to verify, because no intermediate state was saved anywhere.
  • Now the remaining question is who would challenge. It is assumed that the project team running CUs for process A would have an interest in at least running one CU for process B as well, because A’s correctness depends on B’s output. Also because AO is an open-membership protocol, anyone is free to run any CU. According to unofficial discussion, A’s stake would also be slashed if A ever accepts a “wrong” input from B without verification. So there is even greater incentive for A’s team to help with verifying B. So if we expand this logic, teams would cross-verify each other if their project has dependencies.

If I’m allowed to make an analogy, AO is basically a distributed network of off-chain indexers, each responsible for themselves (by computing for their own process, and also by offering to run CUs for processes that may give them inputs). All events (inputs & outputs) are permanently recorded on a storage chain AR. AO’s security relies on optimistic challenging + staking + slashing, which is managed by a group of CUs instead of a smart-contract enabled Layer 1.

So, is AO’s computation verifiable? Since everything is recorded, and computation is assumed to be deterministic, then yes it is.

How much of an effort will it take to fully verify the computation of a single transaction? I think it is equivalent to running a full node that syncs all histories of all processes. First of all, every CU must stake, so to verify their respective membership, you need full state of the special staking manager process, and I’ll assume this will be the main token ledger of AO. Then almost all processes will take input from or give output to the main ledger, so they all becomes dependencies of each other. Therefore a full verification requires computing for all dependencies, which will eventually involve the entire global state of AO. Given that AO’s goal is to “absurdly scale”, this kind of full verification would be impossible to achieve.

Now that running full node verification is out of question, can we run a light client equivalent for verification purpose? This would be the verification of the recorded outputs of a single process against its recorded inputs, assuming all recorded cross-communication between processes are correct.

But this is a very big assumption to make, so big that I’m not even sure it is secure any more. Just compare this to cross-chain communications as noted in the above point 7 when I discussed verification. It is immediately obvious that AO’s design took a radical approach when it comes to verifying the equivalent of “cross-chain messages”. AO’s processes are not even chains, since there are no consensus. When protocols like polkadot and cosmos took extra care in designing a secure cross-chain message exchange mechanism, AO leaves it to optimistic challenge & slashing. When optimistic Layer 2s like Arbitrum are extra cautious to insist a 7-day withdrawal delay in order to limit the potential damage of “wrong outputs”, AO wants every inter-process message to exchange immediately since they can be “verified later”.

It is also unclear whether the CUs in this stake managing process would run a consensus protocol. If not, their own computation would require challenging as well, which falls flat due to the recursion. AO is a young protocol making bold claims, and I hope they can pay more attention to verification and its practicality, because they directly affect security.

My conclusion so far is that ICP chose to make some trade-offs in order to be scalable, AO chose to make a different set of trade-offs in order to be scalable. Both are yet to be entirely proven to be practical, but at least we have some assurance that when an IC canister receives a message from another canister, it is “secure” within the safety parameter of a consensus protocol (of a subnet). With AO, it is “optimistic”, remember?

PS: it is also unclear how a “roll-back” would work when mismatch is discovered and various stakes got slashed. The AO team roughly mentioned something like multi-versions or branches could co-exist, but I’m not sure when and how a re-computation is triggered, because they would necessarily follow the dependency order. It is almost like branching off of an old check point, but you don’t actually know when or where since the team insisted that “AO has no global state”.

67 Likes

This response is absolutely fantastic. Paul you are a legend.

Do we have docs anywhere comparing the Internet Computer Protocol with other computing paradigms in crypto right now?

It would be extremely helpful to take the various categories of blockchains identified here and put them into a nice comparison table somewhere with detail comparing verification/consensus mechanisms across chains.

12 Likes

Thank you for your response which I will be studying more in-depth later.

In a world of ZK and FH VMs, doesn’t this call into question ICP’s focus on replicated computation for verifiability?

Would not AO be setup very nicely for this future?

2 Likes

Yes, it would. But I don’t think it completely diminishes the value of replicated computation, which is still the most straight-forward and perhaps more cost-effective(?!). And as explained below, ZK verification would require consensus too.

Unfortunately no. ZK does not give AO an immediately boost because its verification would necessarily remain optimistic due to its async nature. If you think process A can verify inputs from B by ZK, yes, but that also depends on B’s own input and its dependencies. I can’t possibly imagine A verifying entire dependency closure of B (with their entire historic inputs) through ZK. You have to assume some “verified” checkpoints exist somewhere. How do you arrive at these “verified” checkpoints? Through consensus. Ditching consensus is unwise (For example, you almost absolutely need it for the stake manager).

12 Likes

It seems like AO is flexible enough to build consensus into it in a much more modular fashion…this is one of my main thoughts/wonderings, is that AO might be building a more flexible protocol from the beginning, whereas ICP has chosen a much more rigid design.

From your point of view though, I wonder if you think they would have to essentially come up with the consensus design of ICP then?

Is chainkey really that beneficial here? Verifying multiple signatures is another way to go about it, would it be that bad to deploy your process to multiple CUs, ask for their public keys, and verify the results? Of course you brought up the inter-process verification…

Do we really need this pBFT-style ICC for verifiable compute? Might not staking with simpler verification be good enough in the end?

Especially if it allows a much more expressive computational environment from the very beginning?

Yes but can’t Arweave provide that just fine? At least for client → server, request/response, assuming no inter-process communication? And I wonder how hard the inter-process communication consensus would be in this case, perhaps it’s just on Arweave again.

Am I misunderstanding the role of Arweave in consensus here on AO? There is consensus on inputs to processes stored on Arweave correct?

Are you saying AR would be the chain that runs ZK verification? Yes, it would work, with AR being the bottleneck. This is essentially taking the same roll-up approach taken by the Ethereum community, how is AO then different than multiplying ZK roll-ups by 10-thousand times on Ethereum?

Yet still, inter-L2 communication is very crucial. It has to be the L1 that offers verification, IMHO, because L1 is the synchronization point.

If I may be even more blunt, I don’t think optimistic verification is secure in an async setting without introducing a sufficient challenge peroid (which is sound, but goes directly against the scalability claim). By async setting, I mean either the actor programming model offered by ICP or AO, or the inter-L2 communication between ETH roll-ups.

ZK verification in an async setting would still either require a single L1 at the base layer, or relying on the replicated computation (because ZK verification is still computation) of each chains/subnet/shards running consensus protocol.

10 Likes

Thank you for starting that thread lastmjs!

My assumption was always that because consensus is very expensive, to make economic sense the stateful part of an app should run on the IC while everything that is stateless should probably not. So my initial take was Web3 would imply a novel architecture regarding how we build apps.

In particular, serving or processing large files on the IC doesn’t seem to make economic sense. I’m wondering if that is what you are talking about when speak of vertical/single app limits? Any specific example of limits you can share to help me understand your point better?

Thanks!

5 Likes

Its awesome that the ICP community can have discussions like this.

Blockchain tech has come a long way in the past couple of years - other blockchains (not going to throw mud here) are patting themselves on the back for implementing 1990’s levels of compute… Yet we are here asking if hundreds of GB of storage, http outcalls, timers, chainkey integrations and vetkeys are enough.

It still blows my mind that we have several full-blown apps running fully on chain.

That said, its healthy to reflect on what we can do better to make it easier for the next generation of developers. For me, when building 221Bravo.App the one issue we hit quite often is the instruction limit. The app is quite data intensive - indexing tens of thousands of transactions every day.

As I’ve said before, I dont think every business process needs replicated X number of times. There are plenty of tasks (non-mission critical) that I would happily run on a single trusted server. For example running a force directed map on transaction data. If a node misbehaves it’s not a massive deal.

It would be great to choose how many times your smart contract is replicated. Got a ledger canister - max out the replication… got a canister monitoring cycles ussage… stick it on a low rep subnet.

Looking back at the ICP developments over the last 24 months, I’m confident we will overcome any bumps and the developer experience will continue to get better :grinning:

11 Likes

This may be a practical reality currently, perhaps always, but it is not the vision put forth over years from DFINITY. I would like to pursue that vision if it is at all possible.

2 Likes

These limits get hit often in a variety of circumstances and preclude many many things from being built…there are so many things.

wrt instruction limits this is my mental model, please correct me if I’m wrong:

At the moment, the entire subnet’s state, including the state of all canisters running on it, is certified and committed in each block. The rationale behind this design was to reduce overhead: batching state certification and notarization of a large group of canister is more efficient than doing it on a per canister basis, potentially numbering in the tens of thousands.

In this context the instruction limit serves mainly two purposes:

  1. Achieving steady block times; without it the finalization rate would be as fast as the slowest running canister.
  2. Ensuring subnet can be checkpointed at regular intervals.

This architecture makes sense, as most canisters won’t be bottlenecked by execution, but in some cases we’d rather have subnets with fewer, but less restricted, canisters.

DTS was implemented to mitigate the limitation, allowing message execution to span across multiple rounds, but it doesn’t completely address the issue and comes with its own drawbacks.
Let’s take a 100B instructions message as example, a modern CPU is capable of processing such a workload very quickly, but with DTS it would take almost a minute, which is terrible for the end user.
This is cause each round can still only process a finite amount of WASM instructions for the sake of block rate and there are hundreds of milliseconds between each of them, during which no progress is made.
The number of rounds a single message can be spread over also can’t exceed the checkpointing interval.

What if a new subnet type were introduced where, instead of having constrained execution rounds, each canister can run as long as it needs to (maybe with some loose safeguards to prevent infinite loops)?
In order to make this happen the “subnet” shouldn’t attempt to certify its entire state at a fixed interval but instead:

  • Do the usual gossiping and ordering of messages
  • Induct messages into canister queues
  • Exhaust the canister queue at its latest certified height + 1
  • Sign the canister state
  • Gossip finalization shares on a per canister basis
  • Certify canister state at latest height with enough shares

With this approach:

  • Each canister has its own finality determined by the usual BFT latency + the time it takes to complete execution. It isn’t bounded by other canisters, nor penalizes them.
  • UX is improved: long spanning messages are executed as fast as the hw can process them, so response times to end users improves drastically.
  • Long spanning messages only prevent checkpointing of individual canisters instead of the entire subnet, while still not ideal, the amount of instructions a message must execute to make this happen would be order of magnitudes higher and it is still less disruptive.

A per canister certification model would also make it easier to load balance the network by moving single canisters around, which in some cases is preferable to subnet splitting and possibly even open up the possibility for adjustable replication factor in the future.

4 Likes

In one of my discussions if I remember correctly, the idea was thrown around that a CU could wait for multiple CUs to post a message before accepting it. Even if the team did not say that, could this be a way of performing the verification?

My discussion seemed to paint AO in a very flexible light, where a threshold of correct outputs could be waited on before trusting. Chainkey might help with the verification latency and complexity, but this could be done with some registry of public keys of just asking for the public key and having the right to slash, could it not?

1 Like

[I will be trying my best here to stay focused and dispassionate, so please excuse any accidental slip-up. I’m a software engineer and do not have any PR training. Or inclination.]

There are literally very simple, straightforward solutions to most of the points raised here.

E.g. you can have a single-replica subnet (or 4 replicas in the same data-center, if you want); it is literally one NNS proposal away. With some tweaks to Consensus timing, It should give you amazing latency You will just have to cross your fingers and hope that no one cares enough to hack into your one replica; and that the data center doesn’t go up in smoke.

Similarly, you could set arbitrarily high instruction limits. Or only checkpoint once a day; then you could have DTS executions spanning hours. Just hope it’s fewer than 4 concurrent DTS executions at once; and that enough replicas make it to the end of the day, so you won’t have to wait for an extra day or two to finally get that checkpoint.

Or make an NNS motion proposal to reduce fees 10x. Again, trivial.

The point I’m trying to make in my stunted, patronizing way (and you have my sincere apologies for that) is that virtually all of the limits currently in place are the result of some compromise: fees vs tokenomics; instruction limits vs liveness; latency vs availability / tamper resistance; and so on.

There are also cases where the limitations are due to lack of resources and/or prioritization. E.g. we have some pretty clever ideas for cheap, scalable storage, side-channel computation and huge payload limits, all-in-one; but before piling on more weight, we need a more scalable persistence layer; and messaging model; and more efficient certification; and lower DSM overhead.

In some cases, we may not prioritize the right things. And in many cases we may not have the right solution at all (I know I don’t). So we very much appreciate constructive feedback.

I just don’t find that more everything with lower latency at lower cost all at once is realistic, even with infinite engineering resources behind it. You must make trade-offs and some of them are not as obvious as the ones I listed above. One tiny example: it would be trivial to set up a small replication or geographically concentrated subnet. But how much can other subnets trust it? (E,g, to not mint cycles out of thin air?) How much can users trust it? How does a user even know whether they’re interacting with a canister running on a 40-replica subnet vs. on a 1-replica subnet? If they somehow can tell they’re interacting with a DEX running on a 40x subnet, how do they know its backend isn’t running on a single-replica subnet? And whatever solution you come up with to address all of that, will it be worth the huge added complexity to the protocol?

I’ve seen some pretty clever solutions to specific issues proposed in this thread. Many of which I’m not ashamed to admit I would have never thought of myself. They usually miss some aspects that would make them difficult or impossible to implement as suggested, but much of that is because it’s really hard to communicate the full picture across this medium (and I know I’m definitely not putting in nearly as much effort in this direction as others) so as a result those proposing them are not as neck-deep in the protocol and its implementation as I am. I only know many of these constraints/imitations myself because I came up with similar ideas, brought then up and someone kindly pointed out to me what the issue was.

So I’m afraid I don’t have a good answer to many of the points made here. I definitely don’t want to claim any authority or higher standing than anyone contributing to this thread, But I do know that for the most part, there are no simple solutions. Whenever you make something look simple, you have to pay for it elsewhere. And it’s often not obvious until long after.

I’m happy to talk specific ideas or solutions. And likely so are others. But I don’t think this thread is the right place, they would just drown in the flood of messages. So if you do have a specific idea that you think is worth discussing; and don’t mind me shooting it down and picking through the rubble, please start a new thread and ping me. We’ll see where it goes.

40 Likes

A lot of the purpose of me doing all of this is to gather enough support across enough people across DFINITY and outside of it, to get these major issues prioritized.

I hope the decision-makers will take all of this feedback and prioritize finding revolutionary solutions to these problems, as I know they are non-trivial, and some may be impossible with current consensus technologies, without more major trade-offs.

6 Likes

Just one clarification about instruction limits. It’s the limit of a single message. It doesn’t limit how many instructions a task/endpoint can perform: you can chunk up the computation into several async calls, and the computation can span as many rounds as possible without hitting the per message instruction limit. Of course, we need better language/runtime support to make this idea more end user friendly, but I don’t think instruction limit fundamentally limits how much computation we can perform on the IC.

17 Likes

This is a bit stale after Paul’s amazing post, but I’ll post it for the record:

This is a response to @lastmjs ’s thoughtful post on DFINITY ’s Internet Computer and the comparison with #ao. (https://x.com/lastmjs/status/1766471613594083348?s=20)

tldr: The IC is awesome, DFINITY has built something amazing, it interops with ao, lets go build.

:point_down: (see below)

I’ve also spent the last week or so investigating ao to the extent that I wrote what is probably the first ANS-104 compliant data item signed with a trustless t-ecdsa key and broadcast from the IC. This is a precursor to being able to run the mu, su, and cu components of ao on the IC. (https://x.com/afat/status/1765820285700247707?s=20)

I’ve spent the last 3 years building on the IC for myself, Origyn(we build a super NFT for robust, permissionable, data-rich certificates of authenticity) and shepherding a number of devs through development on the IC through ICDevs.org. The 3 years before that I worked alongside and sometimes with DFINITY in anticipation of their launch with a focus on how the IC could be used to enable web3 for the enterprise. Now with @Pan_Industrial we’re making that a reality.

I’ve had the pleasure of watching @lastmjs do some amazing things on the IC including bringing TypeScript to the IC. In 2019, when I found out that javascript, the language of the Internet, despite all its flaws, was not going to be an initial option to build smart contracts on the IC I was a bit disheartened and felt a huge opportunity was being left on the table for initial adoption. Jordan has almost singlehandedly closed that gap. We do differ a bit on what we think the IC will ultimately be best for which you can read about more in this thread if that has interest to you. I just hit the instruction limit! - #9 by skilesare The tldr is that Jordan and I are hacking away at different approaches to the IC both of which I think are necessary for the IC to have a maximum surface area for success. This thread will push my agenda a bit unapologetically as I think the best way for people in the ecosystem to grow and find new ideas is to continually discuss and reevaluate our assumptions.

To the ao engineers that will read this and cuss me for my naive understanding, please correct where I’ve made poor assumptions or have a wrong understanding. I know what a feat it is to actually ship and you all should be incredibly proud of what you e accomplished up to this point.(Same for my friends at DFINITY where I’m still learning and not much of an expert on the actual replica.)

The first consideration to mention is that, at this point, ao is a protocol with a single reference implementation that has limited security guarantees. The IC is a functional system that has been securing $X B worth of value for almost 3 years. Currently, ao is aspirationaly sophisticated, but functionally inferior in almost every way. Not unexpected as it launched a week and a half ago. Set proper expectations if you want to evaluate the two. It will be enlightening to check back in three years and see where each is.

What is the best we can hope for if we are trying to get computers across the globe to reach consensus about calculations across previously agreed upon state? Likely it is available state(s1) + transform(t) + zero knowledge proof of correctness(zk) = state (s2). I am unfamiliar with any solution outside of this that could deliver distributed computation faster and with more finality. The crux of a debate around ao vs IC is that this optimal solution is inside the bounds of the declared ao protocol. It does not exist yet, and it is likely that if it did the zkWASM prover would force latency to a rate far above what a parallel ao implementation was that made the same trade off of the IC. That trade off is that, according to the proof of the IC, see Achieving Consensus on the Internet Computer | by DFINITY | The Internet Computer Review | Medium, we can assume correctness give 2/3 honest nodes. There are likely more optimal trade offs than the IC makes that can be found, but the selected trade off and others combinations all can exists under the ao protocol definition. Therefore it just really isn’t appropriate to compare all possible ao implementations to the one specific implementation(the IC’s selection on this graph).

There are other graphs with different axis that are just as important, but I think this frames the debate the best) I’ll propose a theory latter as to why I think the IC and Evm on IC might be able to shift right out of the ao space, but for now let’s assume that all these networks sit inside the possible ao protocol network space.

Now that we have a frame I’ll get back to Jordan’s comments.

My concern with the approach Jordan has taken is that his goal is an “unbounded scalable virtual machine.” I believe that baring a quantum revolution, I do not think this is achievable without making a concession somewhere along the continuum of latency, storage, liveness, or security. Trying to shove either of these systems into that standard will result in disappointment.

What the internet computer has achieved is an “unbounded and scalable strata for actor-based computation.” Ironically, this is also what ao seems to be claiming to be with the focus on the actor-based model of computation. Classic web2 does not fit into that actor paradigm. It has evolved into a microservices architecture where each service has unrestricted fast access to unbounded data stores. We just won’t get this on the IC or ao. Your actors will have to be selective about what data they compute over. You scale by replicating these actors across ‘subnets’ where each actor operates independently. This takes a fundamentally different data storage and query architecture than classic web2. It isn’t more efficient (and by definition likely can’t be) but it is the pattern the world uses to create anti-fragile, sovereign entities that burst into the diverse universe we have around us. The trade-off from moving from the fast, micro-service architecture backed by centralized monolithic data structures is a fundamental shift in ownership, the movement of value to the edges of the network, and disruption of centralizing entities that too often take advantage of their power to extract excessive value. Actors are slower from a latency perspective, but better from a parallelization perspective. An EVM has nice atomic properties, but it is a monolith and the person holding the keys to scheduling the transactions gets the same centralized kind of power(MEV extraction) that we’re pushing back against in web2.

The entire current L2 diaspora is much less about decentralization than it is about value extractors trying to convince you that you should let them order your transactions, extract to the MEV, and fund their investor’s kid’s trust fund. Actor networks do their own ordering, run their own markets, and keep their value at the edge of the network.

The belabored point here that I bring up often when Jordan pushes on the tech to try to deploy web2 tech to the IC is that, first I’m rooting for him because I too want many of those conveniences, but second, that the ability to do those web2 things should be viewed as a gateway drug and not an endgame. Because of the speed of light, the cost to produce computational proofs, and developer’s ability to always push a tech just a bit further than it was designed to support, no platform will ever out-compete web2 at web2. You just can’t get to consensus and a trustless state as fast as they can get proprietary state. As long as we try to make the IC about replacing existing things we don’t like with something on the IC with THE SAME ARCHITECTURE, we’re doomed to fail. The future for ao or the IC is all blue oceans and if we manage to replace some old web2 stuff with something they will be alien to how we see those services today. Twitter on the IC just isn’t going to be better than twitter at twittering. Ever. And graphQL on ao will never be as fast, performant, or cheap as graphql running on web2 infrastructure.

I’ll now go through the deficiencies of the IC he mentioned and discuss how many of those are features and not bugs for the long-term viability of not just the IC, but ao, and any system ultimately trying to bring web3 values to a broad audience. And that that is ok.

  1. Instruction limit of a few seconds

This is an actor model issue and I don’t see how ao would handle this any differently than the IC. Because your function is s1 + t + (c | zk) = s2 your actors are generally single-threaded ( c being consensus). Maybe there is a multithreaded actor model out there, but if not, even with a single node where zk provides the correctness you have to process the messages in order so you can’t get to s4 without going through s1, s2, and s3 somewhere in the network. If a machine needs s5 it has to wait for someone to deliver the proof of s1, s2, and s3. If s2 takes 5 minutes, your actor isn’t doing anything else for 5 minutes. If you try to do something in s5 that s2 changed out from underneath you, your s5 will trap and your state won’t change(ie, at the end of s2 you send tokens so they aren’t there anymore when your s5 transaction gets there).

How do you counteract this? The same way databases do this kind of thing. You lock what you expect to have rights to change and process over many rounds. You can do this today with tools like ICRC-48 icrc48.mo/standard.md at main · PanIndustrial-Org/icrc48.mo · GitHub that split tasks across rounds so that other processes can be run(as long as they don’t run into your locks). ao doesn’t get the benefit of the doubt here that they’ll just be able to pull something out that makes them magically be able to parallelize many transactions. The spec specifies the transactions must be processed in the order assigned by the su. In fact, I think a cu that was trying to do this would have to issue itself messages routed through the mu/su to split this work into chunks. If these messages have to settle to arweave before they can be properly sequenced and consumed by the cu then it is likely fundamentally impossible for ao to outperform the IC without resorting to significantly superior hardware. Thus, with the su built into the IC via crypto math and the insistence on high machine specs, I predict that many IC applications exist outside the possible ai network space due to the internet latency of message passing. If the cu and the su are the same process and the cu looks ahead to see if it can continue uninterrupted, then maybe….but then you have a single node cu/su and I’m not sure how you give that any kind of security and trust guarantees(maybe zk?).

Unless the ao guys weigh in differently here, I don’t think it is as simple as “CUs can run whatever computations they are capable of running. If a process specifies that it needs a certain amount of computation, only CUs willing to perform that should pick up that process.” The CUs are still bound by the ordering of the su and must finish message 4 before moving to 5. Your scalable global computer is blocked on a thread until it finishes and can’t serve other requests. This is going to be new software built from the ground up whether you are using ao or the IC. Unfortunately, no free lunch here when trying to create a new Internet with web3 properties.

  1. Message limit of 2/3 MiB
  2. High latencies

These two issues are a speed of light issue when you want nodes across the globe to agree on consensus. Even with zk you’ll need to ship the proof and the state diff many miles which leads to latency. Maybe a pure zk solution where you only need one node to agree could get around this? But if you want your data as part of the proof, it takes time to process and distribute.

There are cryptographic schemes(file hash) that let you skip some of this and none of them are precluded from the IC. You can just as easily ship the IC an IPFS/arweve hash and then serve that file from a gateway to avoid uploading the whole thing. The key here is that if you do that, you can’t compute over the content. Ao will suffer from the same thing. Unless you give it a file hash in your message, the cu doesn’t have access to your file to process over it. According to the ao spec, if you do give it a file hash, it has to load in the file before it can process the message. I can’t imagine this is quick unless the cu keeps all possible files cached. When you upload a file to the IC you’re putting every byte through consensus. This is unnecessary for most cases. I love having the bytes in my canister because it opens things up for compostable third-party services down the road, but we aren’t quite there yet.

Most query latency on the IC is a function of certifying that the data coming out of the query was agreed upon by the smart contract at some point. This system is still evolving, but as far as I can tell, the ao paper makes no suggestion about how the cu is going to certify a response to a requestor. Each cu is going to have to roll its own way to tell a client that it isn’t lying. There will either be a signature of some kind, a slashing mechanism(which theoretically requires significant latency, on the order of days in the optimism world) before you can rely on the result, or a zk proof of calculation output by the cu which will add significant processing time.

The IC does need better patterns for serving data and information that doesn’t need this certification and agreement. But that is more of a UX and architecture issue than a failing of the IC. You’re getting a guarantee that the contract ran correctly and that all nodes agree that the query you just got back in <400ms is valid. If you don’t need that, route around it, but I’m not sure, if you do need that, how you get it with less latency. More processing power will drive it down a bit, but eventually, you run into the speed of light.

As an aside, the architectural solution for uploading large files is to upload each chunk in parallel to different subnets. This helps with serving the chunks as well if you use a client-side aggregator. The solution is UX + architecture but in a new, ground-up solution.

  1. High costs

Arweave’s price for data uploads is significantly higher than the IC(ardrive quotes $3-8/GB, but I think I saw a tweet mentioning $30 the other day). You theoretically get 200 years for that and items under 100kB are free(most ao messages). Thankfully storage tends to follow a Moore’s law curve and this will decline over time. I know storage subnets and static storage are also being considered for the IC which may help bring the cost down. I guess it depends on your perspective with costs as $1GB to upload and $5/GB/year is so comically less than any other web3 platform out there that has compute over content that I’ve always considered it absurdly low cost. You probably don’t want to try to build the next youtube, but it is pretty cheap to host a few videos for a business.

  1. Rigid network architecture

In 2019 I saw and had discussions with DFINITY that had plans for dial-up, dial-down subnets. With the movement on the Swiss-only subnet and discussions around AI on the IC I think there is only a matter of time before we see much more flexibility here. In the meantime, the 2 choices are likely an ergonomic and UX issue while the network is growing. I don’t think there is actually anything keeping you from compiling a 4-node subnet via the NNS if you can get the votes. With UTOPIA and the ‘bad lands’ discussions a couple of years ago, there is obviously flexibility available, but I wholeheartedly agree that a clearer path illuminated on how to get from there to there would be great, but I also recognize that is very hard to do when these markets change as much as they do.

I will note that while “It’s very permission-less and flexible, there is no need for a central authority like the NNS DAO,” Begs the question that if a DAO is not providing assurances that the network can meet the demands of users, then who is? The largest complaint in web3 has been useability due to wildly volatile gas/transaction fees. Heterogenous is great as long as the interface to the service consumer is reliable, consistent, and performs as expected. ao is going to run headfirst into this problem with every solution deployed and the IC has the answer of reverse gas already deployed.

Going back to our diagram, there are a number of architectures that ao could be built on and the IC is in that blue shaded area as one that would work inside the ao universe. In fact, I’m not betting on one emerging soon that works technically better than the IC for ao. That doesn’t mean that ao won’t become the standard because, as we’ve all seen, this industry is insanely memetic and arweave and ao seem to have pretty good market traction and are viewed in a pretty positive light. Basically, if ao does become the dominant world computer protocol, the IC will likely be the best way to deploy on that protocol because all the batteries to reach trustless execution of a mu, su, and cu are included.

  1. Centralizing DAO governance

Jordan’s concerns about the NNS are valid at the present. I’ve thought a lot about how things are set up and I’m firmly in the ‘I trust DFINITY for now camp,’ but I see the need for eventual decentralization and understand that this currently is driving projects away from the platform.

Fortunately, this is a political issue and not a technical issue and one can currently draw a line from today to a universe where DFINITY does not have majority voting power on anything. I wish that pathway was better illuminated today, but I also recognize that the community is NOT ready to take on all the things that go on in the NNS. We should start getting ready now though. I think it will take at least one, possibly two other organizations as well funded as DFINITY to step up and become experts in the network topology, replica development(to the point of probably having replicas in other languages to diversify away code bugs), and have enough operating capital to have full-time employees focused on these things. We seem a way from these entities emerging. Enough tvl and enough value delivery and they will emerge. This leads me out of the IC’s deficiencies and into its advantages.

One of my biggest disappointments and frustrations over the last three years is how much of this debate has ended up being memetic. Of course I should have known this, but it still surprised me. Attention economics end up being as important as tokenomics. For whatever reasons and with hindsight we can see that the memes the IC launched with didn’t hit the market. It probably isn’t helpful to look backwards too much.

We are extraordinarily fortunate that DFINITY was funded to the extent it was because despite memetic whiffs(which are damn hard to get right) on the “let’s replace aws” narrative and the "nodes on the corporate cloud are dangerous " which no one seemed to care about despite it being a serious issue, they have shipped and shipped and shipped. Http_outcalls, t-ecdsa, improved key sharing, and on and on.

We find ourselves at a point where the Internet Computer has an immense amount of superior tech ready to deployed to a web3 world. One only needs to spend a few minutes wandering around the floor of EthDenver to realize how many entire enterprises and L2s have been launched to attempt to solve one of the 50 problems that the IC has already solved to a threshold of sufficient security and comparatively unreal performance. If we are being memetic, why aren’t we leaning into these memes? The latest meme seems to be AI. One day the IC may provide the most secure and trustworthy AI solution one day and everyone is excited about it, but in the next 5 years doubt anyone is going to out AI openAI, google, Amazon, Tesla, and Apple. We have a long way to go for public proof on that meme.

Right now we have built the best btc bridge brige with sufficient security. We have most of the best ETH L2 in place with sufficient security. We can build the best ao mu, su, cu pipeline with sufficient security despite testnet. Let’s do that and launch those memes. Seven-day optimistic withdrawal? Here is a 1-minute solution. $100 gas for a zk roll-up? Here is a reverse gas model that costs a couple of cents. You have a data availability problem? Here is a compute + data solution that solves your problem with self-funding of eternal data storage. Need a crypto secure scheduler? Ours just works up to 500tx/second per subnet.

Yes, sufficient security isn’t sufficient in the long run and we need better decentralization. We need an illuminated pathway to get there(the market seems to be fairly forgiving with third as many of the L2 operate on laughable multisig scheme). It is likely time to say "look…due to VC, funding, and market pressure we had to ship before the vision was complete…we aren’t finished yet … here is the pathway and as soon as we’re finished with the tech we’ll activate it but we are going to build this tech and if you need beyond sufficient decentralization before that, come and take it(an invitation, not a threat)… If “they” can’t say this for some regulatory or securities reason then we need to scream it from the rooftops for them. At some point the rubber will meet the road and the it will either happen gracefully, or it won’t. Yes, for now there will be a segment that will demand pure permissionless from the jump and this is going to be a non-starter for them. They have other avenues.

What if when the rubber meets the road and we need to move beyond sufficient security and the foundation doesn’t go where you want to go? Say in 2030 the new King of Switzerland and Lesser Europe is holding a literal gun to Dom’s head and demanding the figurative keys to the IC be handed over to a {insert your second least favorite form of government} government. (Exaggerated scenario because DFINITY had shown a remarkable amount of good faith to this point even if they’ve had to navigate some regulatory, political, and memetic hoops with a bit less grace than would have been preferred) What do we do? How many people is this scenario keeping from leveraging the insane tech the IC offers today?

The answer likely lies in architecture and proactive game theory planning. External t-ecdsa based UTOPIAs enabling social recovery and forkability? Contracts that regularly checkpoint and publicly stash state in forkable schemes? Co-execution between ao and the IC? All kinds of options are in the table, but I hate to disappoint the community…this is 100% on us. DFINTY is finishing the base layer. In the meantime, we’re underfunded, unorganized, in over our heads, and completely reliant on ourselves to figure it out. Easy-peasy. LFG.

I’m going to keep encouraging Jordan to push on the IC for as much performance as we can squeeze out of physics. I’m going to keep focusing on actor-based infrastructure to drive a web3+ world that catapults humanity forward. We all should look to better understand the decisions and priorities DFINITY is pushing as they’ve done a damn good job on the platform so far. We all will only move beyond the web2 world together if we marshal 10x value platforms for the users. If ao can help, let’s use it. No one is better positioned to accelerate what they are trying to accomplish because we already have 70% of it, and it’s working. In prod. :rocket::rocket::rocket:

#TICADT - The Internet Computer Already Does That

Edit: fixed an image

47 Likes

IC are used to process decentralized trustless (verifiable) state data, and as @PaulLiu points out, the verification on decentralized node limits computing efficiency to the greatest extent possible.

What @lastmjs is actually proposing is non-verifiable, unlimited cloud computing resource, which is not the goal of IC.

But I believe it can be implemented based on IC, it will be a IC’s L2 solution – IC governs this unlimited growing L2 cloud computing network, which hosts applications that do not require decentralized verification.

6 Likes

One tiny example: it would be trivial to set up a small replication or geographically concentrated subnet. But how much can other subnets trust it? (E,g, to not mint cycles out of thin air?) How much can users trust it? How does a user even know whether they’re interacting with a canister running on a 40-replica subnet vs. on a 1-replica subnet? If they somehow can tell they’re interacting with a DEX running on a 40x subnet, how do they know its backend isn’t running on a single-replica subnet? And whatever solution you come up with to address all of that, will it be worth the huge added complexity to the protocol?

I completely get this. The knock on risks to other subnets/ core ICP processes is certainly something to consider.

I don’t however buy into the ‘user’ argument. Given the flexibility of ICP a ‘bad’ dev can do plenty of wrong things on even the most replicated subnet. For example, the frontend might just do a fetch request to a web2 server which could be doing anything really. The DEX backend might have methods that aren’t on the DID file allowing admins to do ‘secret’ stuff. The devs might just decide to push an upgrade which rugs the whole app before anyone knows what’s hit them. None of these issues are linked to the number of nodes on the subnet. Bad devs will be bad regardless.

Knowing what code a canister is running is really relevant here and is something I’ve mentioned before as being on my wishlist. I’d love to go onto the ICP dashboard and have canister code shown like the DID file is. I appreciate this is not a 2 minute job :slight_smile:

6 Likes

I’m particularly intrigued by your perspective on navigating instruction limits for complex computations, such as AI model inferences. Your clarification that the instruction limit applies per message and not to the entirety of a task’s computation opens a doorway to potentially more efficient AI model execution on the IC, despite existing constraints.

Considering the specific challenge of AI inference, which often requires significant computational resources—for example, executing a single transformer layer for a modest AI model—how do you envision effectively utilizing the IC’s architecture to support more complex AI applications? Given the constraints you’ve outlined and the need for dynamic data partitioning to manage instruction limits effectively, could you elaborate on strategies for minimizing added latency while ensuring the computation remains feasible and efficient? Specifically, how might we approach the design of such systems to leverage async calls and chunked computation in a way that aligns with the IC’s unique infrastructure, and what developments in language/runtime support do you foresee as necessary to make these solutions more accessible to end users?

Furthermore, how do these strategies reconcile with the need to maintain performance and reliability, especially when considering the additional layers of complexity introduced by managing state across potentially fragmented computation tasks?

7 Likes