Hashed Block Payloads

lastmjs · March 12, 2024, 1:24pm

This thread is to discuss a very nascent idea to significantly increase the data and computation throughput of ICP. This idea may help to resolve concerns around instruction and message size limits.

Instruction and message size limits are two of the crucial protocol weaknesses pointed out in this thread: Let's solve these crucial protocol weaknesses

These weaknesses preclude many use cases from being implemented on ICP. A solution would hopefully be a great boon to the protocol.

Discussion was started in this thread at this comment: Let's solve these crucial protocol weaknesses - #44 by free

Let's solve these crucial protocol weaknesses

As a bit of food for thought, here’s a brief overview of an idea we (in and around the Message Routing team) have been throwing around to significantly increase (data and computation) throughput on the IC. It is just an idea, with a couple of big question marks and probably a bunch of pitfalls we aren’t even aware of quite yet. And if it does turn out to be feasible, it will require huge effort to implement.

Finally, we’re not claiming it is entirely our idea and that no one else has had more or less the same thoughts. These kinds of things grow organically and directly or indirectly influence each other.

The basic idea is nothing revolutionary and it has even been brought up a few times in this thread: don’t put the full payload into blocks, only its hash; and use some side channel for the data itself. That side-channel could be IPFS running within / next to the replica; the replica’s own P2P layer; some computation running alongside the replica (e.g. training an AI model on a GPU); and so on and so forth. Once “enough” replicas have the piece of data, you can just include its hash / content ID in the block, the block is validated by said majority of replicas (essentially agreeing that they have the actual data) and off you go.

You could use this to upload huge chunks of data onto a subnet; transfer large chunks of data across XNet streams; run arbitrarily heavy and long (but deterministic) computations alongside the subnet; and probably a lot more that we haven’t even considered.

The obvious issue here is that combining many such hashes into a block (or blockchain) could make it so that not enough replicas have all of them. E.g. imagine a 13 replica subnet; and a block containing 13 hashes for 13 pieces of data; each replica has 12 pieces of data and each replica is missing a different one. So going about this naively (and e.g. including all 13 hashes in the block) you’ll end up stalling the subnet until at least 9 of the replicas have collected all the data needed to proceed. This may easily be minutes of latency before the block can even start execution. Even being exceedingly cautious and saying “at least 11 of the 13 replicas must have all data” can lead to the other 2 replicas constantly state syncing, never able to catch up. And what if 3 replicas are actually down? I’m sure there’s a way forward (again, implying some trade-off or other – liveness, latency, safety), I just haven’t found it. The best I have so far is to use this as an optimization: if “enough” replicas have all the data, they propose and validate a block containing just hashes; if not, they continue including the actual data into blocks and advance with much reduced throughput (as now).

As said before, if you have ideas on how to make this work; or completely different ideas of your own that you want to talk about; please start a separate thread. I’ll be happy to chat. Just keep in mind that there’s a long way from idea to working implementation; and that there’s always – always – a trade-off and a price to pay: sometimes it’s worth it, more often than not it’s not.

lastmjs · March 12, 2024, 1:48pm

Further context can be found in the following comments, we will continue the discussion in this thread:

lastmjs · March 12, 2024, 1:51pm

Some have voiced concerns that this would make ICP less “on-chain”. @free is very concerned with the technical feasibility, and there are many unknowns and some unsolved problems.

My last comment was introducing the idea of data availability sampling as found in the Ethereum ecosystem (EIP-4844, dank sharding, EigenDA, Celestia) as a potential solution to allowing all replicas in a subnet to know that a large amount of data is available…something like that.

Let’s continue to discuss.

yvonneanne · March 13, 2024, 9:31am

Hi Jordan
You’ll be happy to hear that we’ve started working on this a short while ago. We’re currently designing an approach that increases the throughput by putting references to large ingress messages instead of the full ingress message without introducing security holes. Needless to say, we want to provide value quickly with as simple an implementation as possible. As a first step we aim to use the same message/block limits, yet this should increase throughput significantly already, and then later explore allowing larger messages. We’ll report on our approach when we’ve made a bit more progress and have something to present.

lastmjs · March 13, 2024, 12:32pm

I am happy to hear this! Very exciting and good luck!

To clarify, will this allow increasing the instruction limit in addition to the message size limit? Or is it focused only on the message size limit?

And will the incoming and outgoing payload of a message be increased?

yvonneanne · March 13, 2024, 12:59pm

Using references instead of the full messages can be seen as a basis for explorations into increasing some bounds.

Increasing the instruction limit is an orthogonal problem, imho (and I’m definitely not an expert on that matter), in particular I don’t think anyone would like to have longer rounds…

In general, the limits for ingress messages and their results as well as many other bounds have dependencies in the upper layers, since they influence what needs to be stored, execution time etc. Therefore it will require careful investigations to determine the trade-offs involved and find the right thresholds. At this point in time I cannot promise when and what improvements are ahead, but rest assured, many people at DFINITY care about and work on such restrictions.

Manu · March 13, 2024, 1:21pm

I just replied wrt instructions limit in the other thread.

free · March 13, 2024, 1:48pm

The approach I was suggesting for longer computation does not involve increasing the instruction limit. To some extent it’s also orthogonal to putting references instead of payloads into blocks.

Similar to HTTP outcalls, every replica is instructed by a canister via the management canister to do some long-running computation independently, outside of the Deterministic State Machine; assuming said computation is deterministic (which may require using canister code as opposed to arbitrary third-party binaries); the replicas reach consensus on the result and include it into a block.

Where references come in is that they would allow for arbitrarily large results (e.g. training an AI model). Although based on your earlier message (and my own idle thoughts on the matter) enforcing the same payload limits on references is a lot easier to achieve. (My own line of thinking on the matter was that references could be used under ideal conditions, e.g. when all replicas are healthy, making progress and have the payload; and you could always fall back on including the payloads into blocks when conditions are less than ideal. Which would make it harder to deal with larger payloads.)

lastmjs · March 15, 2024, 12:02pm

@free I thought your idea would also increase instruction limits as you (seemingly) allude to in these quotes.

But now in this thread it seems like that’s not what would happen? Can you help me to understand?

dsarlis · March 15, 2024, 12:53pm

Increasing the instructions limits and being able to run arbitrarily heavy and long computations are related but not exactly the same thing.

Increasing instruction limits would mean that you can run longer computations inside canisters. The idea brought up in this thread is more tailored towards other forms of computation outside of canisters (e.g. running a workload on a GPU) and once that computation is done and agreed upon (which can happen on the side while the replica keeps executing other (canister) messages) we can return the result somehow back to your canister.

For reasons that have been brought up in the other thread (and previously in the past), increasing instruction limits beyond the checkpoint boundaries is not an option in the foreseeable future (at least not an easy one that we can even imagine how it could be pulled off).

skilesare · March 15, 2024, 4:12pm

I’m making an assumption that I’d like to verify. If we pass an ingress message via ref, the canister doing
the execution still has to load the message into memory and each additional byte uses some number of instructions. Is this correct? So there would still be some maximum dictated by the max execution limit?

massimoalbarello · March 15, 2024, 5:00pm

I think the analogy with the HTTP outcalls is really on point (if I understand correctly). Basically, in the same way that a canister “instructs” the replicas to send a request to a web2 API and then agree on the response, a canister could instruct other “worker nodes” to execute arbitrary (deterministic) logic and then agree on the result.

I’m wondering if these worker nodes should be part of the IC (in the sense of sitting next to the replicas and form some kind of node on steroids) or if they could be part of other networks that are specialized on particular tasks.

In the latter case the IC could leverage existing (and coming soon) networks that are focused on AI training, storage, long running computation or whatever else. Is this in conflict with the 100% on-chain principle? I guess it would still be fine if this “off-chain” computation is run on multiple workers and the result agreed upon by the canister. The assumption would still be that a certain amount of the workers is honest. Any thoughts on this?

free · March 15, 2024, 9:25pm

If you wanted to delegate work to other machines, there would be no need for said machines to be replicated at all or use consensus. I.e. you would be doing something very close to a regular HTTP ouitcall. When the response materialized, you could reach consensus on it either because it’s e.g. a block signed by some other blockchain or possibly even based on something as simple as an SSL certificate of someone you trust.

If you want the kind of replication and trust model that a subnet provides, then the work would have to be done by the nodes running the replicas themselves. Could be something as simple as a very long running computation (minutes, an hour); or you could have a subnet made up of nodes with GPUs and each of them could train an AI model independently and then (assuming the training is deterministic) agree on the result (same as they agree on the response after an HTTP outcall).

massimoalbarello · March 16, 2024, 11:33am

Why running a computation on multiple servers (that are not replicas) and then agreeing on the result is different from running it on the replica nodes?

I see three options to extend the computational capabilities of the IC (with GPUs, extra storage, or whatever other resources needed):

replicas become “supernodes”
specialized subnets
deep integration with other specialized networks

The third option would enable the IC to take advantage of all the progress of the rest of the ecosystem. While I agree that the IC is years ahead from many points of view it’s difficult to believe that it will always be best in everything.

So why can’t we delegate some specific work to other machines (that might do it “better”) while maintaining a “similar” trust model to the one of the subnets?

free · March 16, 2024, 12:31pm

Because it’s a different trust model: would you trust Bitcoin or Ethereum if it ran on 13 anonymous nodes? Also, if this is a replicated computation outside of the IC, how can one even tell that it’s replicated (as opposed to e.g. one machine with 13 IP addresses)? So it would be very much an HTTP outcall: send a request to some machine (single or part of some other blockchain network); wait for a response; induct the response onto the subnet because you trust some signature on the response.

The machine you sent the request to may do all the work itself or be part of some consensus protocol that’s unrelated to the IC, but there is no way for the IC to tell what actually happened. You, as the author of the canister having made the HTTP outcall may choose to trust the response (e.g. because it’s an Ethereum block or whatnot), but the IC itself has no way of deciding whether it should trust an arbitrary response from an arbitrary machine / network.

massimoalbarello · March 16, 2024, 1:24pm

Yes I agree, it might be as simple as sending an HTTP outcall to another network.

I’m not suggesting to run the computation on n random servers (that might actually be a single one, as you said).
But to run it on another network which, even though works differently from the IC and has a different trust model, provides some desired guarantees (as long as the assumptions of its model hold).

A canister integrating with such a network, in order to operate as expected, would not only depend on > 2/3 of the IC replicas being honest, but also on the assumptions of the other network being valid. But I don’t think this is a deal breaker. Indeed, even if the trust model of Bitcoin/Ethereum is different from the one of the IC, it doesn’t mean we shouldn’t integrate with them.

So as long as another network provides a service with some desired properties (under some reasonable assumptions), why shouldn’t we leverage it to complement the capabilities of the IC? Am I missing something?

Anyways, sorry for going off the topic of the thread…

free · March 16, 2024, 5:29pm

I’m not saying we should not integrate with other networks. Just that that would very literally be an HTTP outcall. Something you can already do today (as long as you stay under a 2 MB payload limit). It can be tweaked / optimized to make a single outgoing request instead of 13; and accept a single response (instead of 13) that is verified using some other method (e.g. a signature from the other network), but that won’t change the fact it’s an HTTP outcall as opposed to an arbitrarily long computation running on the IC (canister code or something fancier).

timo · March 16, 2024, 8:09pm

Consensus already has a system of alternate block proposal from different block proposers, ranked randomly into “slots”, slot 0 being the highest priority block maker, slot 1 the first backup, etc.

Maybe this existing system can be leveraged to mitigate the problem you describe. For example, say only the slot 0 block maker is allowed to include blobs into its proposal, slot 1 and higher only make traditional block proposals without any blobs, where “blob” means data included by reference. And notaries ignore block proposals for which they don’t have all blobs (i.e. treat them as if the block proposal hadn’t arrived). That way the chain would continue in your scenario working with slot 1 or higher blocks until enough notaries have the blobs that appear in a slot 0 block.

In this way, our consensus proofs guarantee liveness.

It can of course happen that the chain continues with 2/3 of nodes and 1/3 never catches up but that has always been the case, blobs or not.

free · March 16, 2024, 9:38pm

I believe we did discuss this option too. The downsides are that (1) given enough blobs it’s unlikely for the rank 0 block to ever be accepted; (2) each payload must still fit into the block (or risk stalling indefinitely) and (3) it’s arguable whether one can describe it as graceful degradation (if canister developers start building apps with the expectation of tens to hundreds of MB/s, falling back on 4 MB/s is a pretty serious hit).

Also, if you do end up with 1/3 of the replicas behind (and the other 2/3 using significantly more than 4 MB/s just because they can) it may be quite hard to catch up.

I’m not dissing your idea, it’s quite good. But I do think it needs quite a bit of refinement to be useful in most practical scenarios.

Jdcv97 · April 10, 2024, 4:57am

I don’t agree, as soon as the IC start relaying on third parties to something that important like computation and storage WHERE’S THE ADD VALUE OF THE IC!

Really guys I’m not sure if you doesn’t care or don’t see the things in the marketing perspective and add value, many people in the community and investors are here because of the ON CHAIN ideology, and now because few people are pushing that they want web2 computing power immediately on the IC we are going to pivot and not improve but start going backwards?

I really hope dfinity team members are aligned with dominic Williams and his vision and do the impossible doesn’t matter what, but achieve it, I don’t want you Dfinity team members be pushed so hard by some (really small) Group of developers pushing you to do things to “solve” some limitations as soon as possible, take your time but the revolutionary product is the one who wins in the end, in we are going to do the same like every other decentralized computing project, We will felt on the same line and we will compete with too many projects.

“It is better to have a large part of a small market with growth potential ( crypto cloud, on chain storage, on chain computing, on chain zk proof verification) than enter on a market that is initially larger ( off chain compute, off chain decentralized storage) but with too much competition”
Sam Altman

Topic		Replies	Views
Consensus and inter-canister calls Developers	12	3202	October 29, 2020
Let's solve these crucial protocol weaknesses Developers	128	12886	July 9, 2024
ICP.Lab Storage & Scalability Summaries Developers	18	4769	April 9, 2025
Scalability of update calls in a common scenario Developers	21	4029	October 30, 2020
High User Traffic Incident Retrospective - Thursday September 2, 2021 Developers	50	8973	October 30, 2021

Hashed Block Payloads

Related topics