Llama 3 8b is running on-chain!

Hey folks,

My interest in decentralizing data (and bringing a new model of data ownership) and distributing compute dates back years and I had wanted to build a non-trivial app on the ICP for a while now.

I took a few days during the summer to learn a new language and experiment with the implementation of a real world LLM in a distributed way on the ICP. Llama 3 8b inference is now running live on the ICP, including the tokenizer, which I think makes it the first fully on-chain 8b LLM. Consensus runs on the inference algorithm as well, bringing a new level of guarantees with regard to the results.

The current implementation is live on mainnet, but is not public given the amount of cycles being burned and the latency.

As a product it’s not usable yet though!

  • The current implementation burns 0.168 TC per token predicted, which translates to $260K / 1M tokens. Compare with $15 / 1M output tokens for gpt-4o.
  • Result above is achieved without much optimization and without SIMD for instance. During a matmul (which is where transformers spend most of their execution time), we currently have 22 cycles per multiply/accumulate. We could hope to improve that to 2 cycles per multiply/accumulate but that would still translate into $26K / 1M tokens. To achieve something usable and competitive, we will very likely need GPUs and price the amount of cycles they use close to zero.
  • Loading the weights into the blocks is very expensive and cost a total of 33 TC (F16-quantized tensors were used, computation on-chain is using float32). I think we will need to think a little bit more about how we want to leverage data on the ICP and probably develop the notion of ‘data nodes’ if we want ICP to be competitive in the data-centric markets (that includes AI).
  • Last but not least the latency is 200 seconds! I’m actively looking at why that is and will improve that soon.

Regarding the implementation, I’ve developed a set of generic ‘yblock’ that can do different operations and connect to each other using inter-canister calls. The current model is distributed over 34 canisters. Feel free to check it out: GitHub - gip/yllama.oc and GitHub - gip/yllama.rs: Inference with Transformers in Pure Rust.

Learning and next steps

  • Manually handcrafting models was fun the first time but it is not a very effective approach and optimizing compilers / tooling will have to be developed if we want the ICP to become an execution target for AI models
  • I’d love to open the implementation for more people to use but for that I would need a lot more cycles - if someone is willing to pitch in and help with cycles please let me know
  • The canister programming model is simple to reason about and the tools / libraries (like ic_cdk) easy enough to use
  • The memory management inside a canister is a little hard to understand I wasn’t able to use the stable memory solutions - crash and not access to the right abstraction
  • Cool to see inference being run through consensus, I think that could have real-world applications soon
  • My next focus should be on the tooling and on the data ownership piece, where crypto can really shine

Any questions, let me know.


One question to help me understand:

How fast/slow is it for a standard, simple prompt/response? Very curious.

Measured in seconds? Tens of seconds? More?

1 Like

On a CPU with enough RAM (64GB), Llama 3 8b inference will be able to do 1 token / second at best.

On ICP, we can go up to 2GB per canister and partitioning the model was what I wanted to explore. The added overhead from the communication between canisters (canisters only exchange ~16KB of data, so it’s reasonable), from consensus and the fact that wasm isn’t optimized for this workload bring the latency to ~100s. So as I said it’s not usable as such as, except for batch stuff, waiting 100s for an answer is a no go (even 10s is too much). Optimizing the implementation is definitely possible.


Thank you so much for this and for presenting at the DeAI Technical Group.

A few questions:

  1. How would you reduce 2 cycles per multiply/accumulate?
  2. How are you spreading the model between 34 canisters? Can you pinpoint the different canisters? I noticed that each canister has the same .did file. I assume that they different though.
  3. Would optimization or SIMD significantly improve performance?