Llama2.c LLM running in a canister!

icpp · October 4, 2023, 11:06pm

@evanmcfarland
I have not done a cost/token comparison yet. I plan to do that though.

icpp · October 4, 2023, 11:18pm

@paulous ,
Thanks for that reference. It looks really interesting, but I am afraid porting it to the IC will be a big task. I think it will require to run pytorch in a canister and that is still far away.

It will be interesting to see though if some of their ideas will make it into the pure C/C++ LLMs we can run in a canister today

paulous · October 4, 2023, 11:28pm

no worries, just saw it on a blog and looked interesting so thought I would pass it by the experts. Thanks for the feedback.

Mercury · October 6, 2023, 4:42am

Btw, TheBloke is funded by a16z, and bakes some of the best GGUF(GGML) out there

patnorris · October 21, 2023, 7:10pm

Hi @paulous , if you like you can give Mistral 7B a try here: https://x6occ-biaaa-aaaai-acqzq-cai.icp0.io
Note that the model is not running on-chain but is downloaded onto your device and runs there (you thus need a laptop with 16GB RAM to run it smoothly). To use Mistral 7B, you can log in and select it under User Settings. Then, on the DeVinci AI Assistant tab, click on Initialize to download it (will take a moment on first download). Once downloaded, you can chat with it. Please let me know if you have any feedback Cheers

icpp · October 22, 2023, 2:19am

@patnorris
This is very cool.

I am a big proponent of solutions like this, in browser or in canister, to avoid having to call out to web2.

I am planning to buy a big machine now🙂

ZackDS · October 22, 2023, 8:08am

Any specific specs you looking at ? for the big machine I mean .

patnorris · October 22, 2023, 10:14am

@icpp thank you and agreed, there should be good alternatives to the common Web2 players. And I’m looking forward to the ones we’ll build

Yes, running the LLM in browser is currently limited by which device the user has. Having it in a canister would be more accessible. I could see great hybrid solutions as well

Same here. The high-end machines can even run Llama2 70B. So if you go for one of those, let me know and I’ll add the 70B version for you to try (same if anyone else wants to give that a shot, just let me know).

icpp · October 23, 2023, 1:02am

@patnorris ,
How much RAM is needed for the 70B model?
Any other specs to keep in mind?

patnorris · October 23, 2023, 8:13am

@icpp I’m reading that the 70B model requires 64GB RAM. I haven’t found any exact specs for the GPU but as far as I understand it’s needed to run the model smoothly (even though CPU only might work), so best to have that.

qwertytrewq · November 15, 2023, 5:33pm

@icpp At https://icgpt.icpp.world/ Llama2 isn’t available, only TinyStories.

Is it already possible to compile (and self-host) Llama2 using GitHub - icppWorld/icpp_llm: on-chain LLMs or is it not yet ready, too?

icpp · November 15, 2023, 8:25pm

Hi @qwertytrewq ,
you can indeed self host using the icpp_llm repo.

That is actually the code for Llama2:

It is the backend of ICGPT, but it is loaded with models trained on the TinyStories dataset
You can deploy yourself and load it with other trained models.
The pre-trained Llama2 models from meta can be loaded, but do not yet fit. I am working on it and also keeping a very close eye on the underlying core code at GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C

qwertytrewq · November 15, 2023, 8:27pm

What do you mean by “fit”?

If it can be loaded, then it can be used, can’t it?

icpp · November 16, 2023, 12:21am

When installing an LLM in a canister, there are two steps:

Deploy the wasm
Upload the trained model (the weights)

Step 1 is always the same, and that fits without any problem in a canister.

Step 2 however is a different story. Uploading the model weights is done by running a python script that is provided in the icpp_llm repo. This script must be pointed at a model.bin file on your disk. It will read it in chunks, and send it to an endpoint of the canister where the llm is running. When the canister endpoint receives those bytes, they are stored in an std::vector, which is orthogonally persisted. That vector grows dynamically, and there is where the fit comes into play.

Note that this upload is done only once. If the upload succeeds, then the model fits!

Small models, with eg. Millions of parameters fit just fine, but models with several billion of parameters will hit the canister limit.

If you want to dig deeper to see where the memory goes once all fully loaded, this is a study I did. I find it very interesting and believe there is still a lot of room for improvement.

icpp · November 24, 2023, 3:42pm

Hi All,

I like to announce that the frontend is now also open source under MIT License

You can find it at GitHub - icppWorld/icgpt: ICGPT

And it is deployed on mainnet as ICGPT

I hope some of you will try out to deploy this as we continue deAI efforts. Let me know if you run into any issues.

ildefons · December 14, 2023, 10:49pm

(First, excuseme for xposting). I would like to introduce Motokolearn, a Motoko package meant to facilitate on-chain training and inference of machine learning models where having a GPU is not a requirement: Introduction of MotokoLearn V0.1: on-chain training and inference of machine learning models

Samer · April 2, 2024, 1:07pm

@icpp any updates?

Starting to get really excited about on-chain deAI as I learn more

icpp · April 3, 2024, 2:14am

@Samer ,

Thanks for your interest!
I share your enthusiasm about AI on chain, and we’re working on several items.

Experimental demonstration apps working within current limits of the IC. Target release during April and May.
Received a grant to port llama.cpp. Target release is in September.
A pipeline to easily deploy and configure your AI components, and be ready to scale once the IC capabilities like query charging, 64bit WASM, and ultimately gpu come available.

I have teamed up with @patnorris on some of these items.

CodingFu · May 25, 2024, 12:16pm

This is so cool! I can’t believe you can run such a model on IC.

I wonder what is the cycle cost per 1M tokens?
Also, maybe someone knows when we will be able to deploy beefier canisters?

icpp · May 25, 2024, 10:13pm

The costs right now are pretty high. A story that generates about 100 tokens consumes 0.076 TCycles = $0.1.

That will come down drastically though once we can do the inference as a query call and when the compute/AI improvements of the DFINITY Roadmap become available.

I expect to already deploy a real chat LLM in a few months, but latency will be poor. Implementing it now though, so we are ready when the IC scales up.

Topic		Replies	Views
Is llama-13B(or 7B) LLM possible to deploy on canister? Developers Discussing	4	490	June 6, 2024
Introducing the LLM Canister: Deploy AI agents with a few lines of code Developers rust , DeAI	48	2980	May 7, 2025
Llama.cpp on the Internet Computer Programs & Applications	11	465	February 2, 2025
Chatbot Canister some advice Developers	1	31	February 15, 2025
Persistence for Llama2.c LLM weights in canister General	0	18	February 16, 2025

Llama2.c LLM running in a canister!

Related topics