Llama 3 8b is running on-chain!

gip · July 10, 2024, 11:31pm

Hey folks,

My interest in decentralizing data (and bringing a new model of data ownership) and distributing compute dates back years and I had wanted to build a non-trivial app on the ICP for a while now.

I took a few days during the summer to learn a new language and experiment with the implementation of a real world LLM in a distributed way on the ICP. Llama 3 8b inference is now running live on the ICP, including the tokenizer, which I think makes it the first fully on-chain 8b LLM. Consensus runs on the inference algorithm as well, bringing a new level of guarantees with regard to the results.

The current implementation is live on mainnet, but is not public given the amount of cycles being burned and the latency.

As a product it’s not usable yet though!

The current implementation burns 0.168 TC per token predicted, which translates to $260K / 1M tokens. Compare with $15 / 1M output tokens for gpt-4o.
Result above is achieved without much optimization and without SIMD for instance. During a matmul (which is where transformers spend most of their execution time), we currently have 22 cycles per multiply/accumulate. We could hope to improve that to 2 cycles per multiply/accumulate but that would still translate into $26K / 1M tokens. To achieve something usable and competitive, we will very likely need GPUs and price the amount of cycles they use close to zero.
Loading the weights into the blocks is very expensive and cost a total of 33 TC (F16-quantized tensors were used, computation on-chain is using float32). I think we will need to think a little bit more about how we want to leverage data on the ICP and probably develop the notion of ‘data nodes’ if we want ICP to be competitive in the data-centric markets (that includes AI).
Last but not least the latency is 200 seconds! I’m actively looking at why that is and will improve that soon.

Regarding the implementation, I’ve developed a set of generic ‘yblock’ that can do different operations and connect to each other using inter-canister calls. The current model is distributed over 34 canisters. Feel free to check it out: GitHub - gip/yllama.oc and GitHub - gip/yllama.rs: Inference with Transformers in Pure Rust.

Learning and next steps

Manually handcrafting models was fun the first time but it is not a very effective approach and optimizing compilers / tooling will have to be developed if we want the ICP to become an execution target for AI models
I’d love to open the implementation for more people to use but for that I would need a lot more cycles - if someone is willing to pitch in and help with cycles please let me know
The canister programming model is simple to reason about and the tools / libraries (like ic_cdk) easy enough to use
The memory management inside a canister is a little hard to understand I wasn’t able to use the stable memory solutions - crash and not access to the right abstraction
Cool to see inference being run through consensus, I think that could have real-world applications soon
My next focus should be on the tooling and on the data ownership piece, where crypto can really shine

Any questions, let me know.

diegop · July 11, 2024, 1:14am

One question to help me understand:

How fast/slow is it for a standard, simple prompt/response? Very curious.

Measured in seconds? Tens of seconds? More?

gip · July 11, 2024, 4:21am

On a CPU with enough RAM (64GB), Llama 3 8b inference will be able to do 1 token / second at best.

On ICP, we can go up to 2GB per canister and partitioning the model was what I wanted to explore. The added overhead from the communication between canisters (canisters only exchange ~16KB of data, so it’s reasonable), from consensus and the fact that wasm isn’t optimized for this workload bring the latency to ~100s. So as I said it’s not usable as such as, except for batch stuff, waiting 100s for an answer is a no go (even 10s is too much). Optimizing the implementation is definitely possible.

jennifertran · July 13, 2024, 2:05am

Thank you so much for this and for presenting at the DeAI Technical Group.

A few questions:

How would you reduce 2 cycles per multiply/accumulate?
How are you spreading the model between 34 canisters? Can you pinpoint the different canisters? I noticed that each canister has the same .did file. I assume that they different though.
Would optimization or SIMD significantly improve performance?

gip · July 15, 2024, 4:55pm

Hi Jennifer

I’m not sure how fast we can get to 2 cycles per mulacc, but basically that would involve performance engineering to understand what wasm does and possibly re-writing the inner loop.

To reduce deploy complexity, I made the design decision to have one type of canister called yblock that can be configured to do different things (tokenizer, main forward, block forward, logits).

All canisters are below. If you click on the first link and click ‘tensor_list’ you will need see what tensors are on that block. You won’t be able to run the model since you would need to hit update calls for that and they are not public.

SIMD has improved things quite a bit already and my sense the main limiting factor is now due to the protocol / wasm overhead. I’m not sure though and I think performance engineering on a real ICP node is required to figure it out.

tcrst · July 16, 2024, 12:48am

@gip how can we help with the cycles ?

gip · July 16, 2024, 9:51pm

Good question… I’ve submitted a grant proposal and if accepted I will be able to continue the work and maintain LLama 3 on-chain. If it’s not accepted I may ask the community to pitch in some TC to make access public and keep the canister going… until it runs out of cycles.

Project is open-source so anyone should be able to run its own canister also (set up may not be straightforward though, I’ll update the docs).

Topic		Replies	Views
Llama2.c LLM running in a canister! Programs & Applications	61	4918	July 1, 2024
How can I take an open source pretrained LLM model, deploy it to ICP and use as a private ChatGPT just fo me Developers	13	371	June 16, 2025
AI and machine learning on the IC? Developers	114	10216	June 20, 2024
AI on Chain Skepticism Programs & Applications	10	527	January 17, 2025
Llama.cpp on the Internet Computer Programs & Applications	11	536	February 2, 2025

Llama 3 8b is running on-chain!

Related topics