Llama2.c LLM running in a canister!

@evanmcfarland
I have not done a cost/token comparison yet. I plan to do that though.

1 Like

@paulous ,
Thanks for that reference. It looks really interesting, but I am afraid porting it to the IC will be a big task. I think it will require to run pytorch in a canister and that is still far away.

It will be interesting to see though if some of their ideas will make it into the pure C/C++ LLMs we can run in a canister today

1 Like

no worries, just saw it on a blog and looked interesting so thought I would pass it by the experts. Thanks for the feedback.

Btw, TheBloke is funded by a16z, and bakes some of the best GGUF(GGML) out there

1 Like

Hi @paulous , if you like you can give Mistral 7B a try here: https://x6occ-biaaa-aaaai-acqzq-cai.icp0.io
Note that the model is not running on-chain but is downloaded onto your device and runs there (you thus need a laptop with 16GB RAM to run it smoothly). To use Mistral 7B, you can log in and select it under User Settings. Then, on the DeVinci AI Assistant tab, click on Initialize to download it (will take a moment on first download). Once downloaded, you can chat with it. Please let me know if you have any feedback :slight_smile: Cheers

2 Likes

@patnorris
This is very cool.

I am a big proponent of solutions like this, in browser or in canister, to avoid having to call out to web2.

I am planning to buy a big machine now🙂

2 Likes

Any specific specs you looking at ? for the big machine I mean .

@icpp thank you and agreed, there should be good alternatives to the common Web2 players. And I’m looking forward to the ones we’ll build :muscle:

Yes, running the LLM in browser is currently limited by which device the user has. Having it in a canister would be more accessible. I could see great hybrid solutions as well :slight_smile:

Same here. The high-end machines can even run Llama2 70B. So if you go for one of those, let me know and I’ll add the 70B version for you to try (same if anyone else wants to give that a shot, just let me know).

1 Like

@patnorris ,
How much RAM is needed for the 70B model?
Any other specs to keep in mind?

1 Like

@icpp I’m reading that the 70B model requires 64GB RAM. I haven’t found any exact specs for the GPU but as far as I understand it’s needed to run the model smoothly (even though CPU only might work), so best to have that.

1 Like

@icpp At https://icgpt.icpp.world/ Llama2 isn’t available, only TinyStories.

Is it already possible to compile (and self-host) Llama2 using GitHub - icppWorld/icpp_llm: on-chain LLMs or is it not yet ready, too?

1 Like

Hi @qwertytrewq ,
you can indeed self host using the icpp_llm repo.

That is actually the code for Llama2:

  • It is the backend of ICGPT, but it is loaded with models trained on the TinyStories dataset
  • You can deploy yourself and load it with other trained models.
  • The pre-trained Llama2 models from meta can be loaded, but do not yet fit. I am working on it and also keeping a very close eye on the underlying core code at GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C
1 Like

What do you mean by “fit”?

If it can be loaded, then it can be used, can’t it?

1 Like

When installing an LLM in a canister, there are two steps:

  1. Deploy the wasm
  2. Upload the trained model (the weights)

Step 1 is always the same, and that fits without any problem in a canister.

Step 2 however is a different story. Uploading the model weights is done by running a python script that is provided in the icpp_llm repo. This script must be pointed at a model.bin file on your disk. It will read it in chunks, and send it to an endpoint of the canister where the llm is running. When the canister endpoint receives those bytes, they are stored in an std::vector, which is orthogonally persisted. That vector grows dynamically, and there is where the fit comes into play.

Note that this upload is done only once. If the upload succeeds, then the model fits!

Small models, with eg. Millions of parameters fit just fine, but models with several billion of parameters will hit the canister limit.

If you want to dig deeper to see where the memory goes once all fully loaded, this is a study I did. I find it very interesting and believe there is still a lot of room for improvement.

1 Like

Hi All,

I like to announce that the frontend is now also open source under MIT License

You can find it at GitHub - icppWorld/icgpt: ICGPT

And it is deployed on mainnet as ICGPT

I hope some of you will try out to deploy this as we continue deAI efforts. Let me know if you run into any issues.

11 Likes

(First, excuseme for xposting). I would like to introduce Motokolearn, a Motoko package meant to facilitate on-chain training and inference of machine learning models where having a GPU is not a requirement: Introduction of MotokoLearn V0.1: on-chain training and inference of machine learning models

4 Likes

@icpp any updates?

Starting to get really excited about on-chain deAI as I learn more

1 Like

@Samer ,

Thanks for your interest!
I share your enthusiasm about AI on chain, and we’re working on several items.

  • Experimental demonstration apps working within current limits of the IC. Target release during April and May.
  • Received a grant to port llama.cpp. Target release is in September.
  • A pipeline to easily deploy and configure your AI components, and be ready to scale once the IC capabilities like query charging, 64bit WASM, and ultimately gpu come available.

I have teamed up with @patnorris on some of these items.

3 Likes

This is so cool! I can’t believe you can run such a model on IC.

I wonder what is the cycle cost per 1M tokens?
Also, maybe someone knows when we will be able to deploy beefier canisters?

1 Like

The costs right now are pretty high. A story that generates about 100 tokens consumes 0.076 TCycles = $0.1.

That will come down drastically though once we can do the inference as a query call and when the compute/AI improvements of the DFINITY Roadmap become available.

I expect to already deploy a real chat LLM in a few months, but latency will be poor. Implementing it now though, so we are ready when the IC scales up.

4 Likes