My interest in decentralizing data (and bringing a new model of data ownership) and distributing compute dates back years and I had wanted to build a non-trivial app on the ICP for a while now.
I took a few days during the summer to learn a new language and experiment with the implementation of a real world LLM in a distributed way on the ICP. Llama 3 8b inference is now running live on the ICP, including the tokenizer, which I think makes it the first fully on-chain 8b LLM. Consensus runs on the inference algorithm as well, bringing a new level of guarantees with regard to the results.
The current implementation is live on mainnet, but is not public given the amount of cycles being burned and the latency.
As a product it’s not usable yet though!
The current implementation burns 0.168 TC per token predicted, which translates to $260K / 1M tokens. Compare with $15 / 1M output tokens for gpt-4o.
Result above is achieved without much optimization and without SIMD for instance. During a matmul (which is where transformers spend most of their execution time), we currently have 22 cycles per multiply/accumulate. We could hope to improve that to 2 cycles per multiply/accumulate but that would still translate into $26K / 1M tokens. To achieve something usable and competitive, we will very likely need GPUs and price the amount of cycles they use close to zero.
Loading the weights into the blocks is very expensive and cost a total of 33 TC (F16-quantized tensors were used, computation on-chain is using float32). I think we will need to think a little bit more about how we want to leverage data on the ICP and probably develop the notion of ‘data nodes’ if we want ICP to be competitive in the data-centric markets (that includes AI).
Last but not least the latency is 200 seconds! I’m actively looking at why that is and will improve that soon.
Manually handcrafting models was fun the first time but it is not a very effective approach and optimizing compilers / tooling will have to be developed if we want the ICP to become an execution target for AI models
I’d love to open the implementation for more people to use but for that I would need a lot more cycles - if someone is willing to pitch in and help with cycles please let me know
The canister programming model is simple to reason about and the tools / libraries (like ic_cdk) easy enough to use
The memory management inside a canister is a little hard to understand I wasn’t able to use the stable memory solutions - crash and not access to the right abstraction
Cool to see inference being run through consensus, I think that could have real-world applications soon
My next focus should be on the tooling and on the data ownership piece, where crypto can really shine
On a CPU with enough RAM (64GB), Llama 3 8b inference will be able to do 1 token / second at best.
On ICP, we can go up to 2GB per canister and partitioning the model was what I wanted to explore. The added overhead from the communication between canisters (canisters only exchange ~16KB of data, so it’s reasonable), from consensus and the fact that wasm isn’t optimized for this workload bring the latency to ~100s. So as I said it’s not usable as such as, except for batch stuff, waiting 100s for an answer is a no go (even 10s is too much). Optimizing the implementation is definitely possible.
Thank you so much for this and for presenting at the DeAI Technical Group.
A few questions:
How would you reduce 2 cycles per multiply/accumulate?
How are you spreading the model between 34 canisters? Can you pinpoint the different canisters? I noticed that each canister has the same .did file. I assume that they different though.
Would optimization or SIMD significantly improve performance?
I’m not sure how fast we can get to 2 cycles per mulacc, but basically that would involve performance engineering to understand what wasm does and possibly re-writing the inner loop.
To reduce deploy complexity, I made the design decision to have one type of canister called yblock that can be configured to do different things (tokenizer, main forward, block forward, logits).
All canisters are below. If you click on the first link and click ‘tensor_list’ you will need see what tensors are on that block. You won’t be able to run the model since you would need to hit update calls for that and they are not public.
SIMD has improved things quite a bit already and my sense the main limiting factor is now due to the protocol / wasm overhead. I’m not sure though and I think performance engineering on a real ICP node is required to figure it out.
Good question… I’ve submitted a grant proposal and if accepted I will be able to continue the work and maintain LLama 3 on-chain. If it’s not accepted I may ask the community to pitch in some TC to make access public and keep the canister going… until it runs out of cycles.
Project is open-source so anyone should be able to run its own canister also (set up may not be straightforward though, I’ll update the docs).