Llama2.c LLM running in a canister!

Would you be willing to pay such a service?
Yes, but only if the price is minor.

If no, would you be interested in an ad supported offering?
It is usage through API. I don’t understand where the ads can be put in this relationship.

If yes, would you prefer fixed subscription fee or usage based?
Usage-based is more fair for both parties.

I also can deploy my own canister, but I doubt that it has a value to duplicate functionality.

BTW, what are obstacles due which Llama 2 is not yet deployed, but only TinyStories are deployed?

Thank you for those answers!

Sending ads through an API is indeed not very easy to do. It was a bit of a dumb question :slightly_smiling_face:

Usage based indeed would make most sense.

The main obstacles to deploying the meta-llama2 models are available resources in a canister, and a limit on the number of instructions per message.

I am working on the llama2-7B model, and hope to get that deployed reasonably soon, so we can run some more experiments.

Definitely interested in digging deeper into enabling your example use case.

FYI.

The repo is public again, although renamed: icpp_llm

2 Likes

Great work on this!

Llama2-7B is definitely a milestone goal for canister LLMs. But in terms of “usable” LLMs for applications (ex. NPCs), there are more compact LLMs that are useful enough for generative text/conversational tasks.

TinyLlama 1.1B project follows Llama2’s architecture and tokenizer formats (easier for fine-tuning/Q-LoRA) and is being trained for 3 Trillion tokens with all the speed optimizations and performance hacks for inferencing. Using the Llama.cpp framework, a Mac M2 16GB RAM generates 71.8 tokens/second. That’s really fast for non-GPU inferencing.

Due to its performance for its parameter size, I’d imagine this is to be the model to test canister LLMs.

2 Likes

Hi @Gamris ,

Thank you for pointing out the TinyLlama reference. I will definitely research it potential and perhaps port it into the dApp as an alternative.

How do you think the karpathy/llama2.c LLM that I am using in the canister compares to that one?

It is also based on the identical Llama2 architecture & tokenizer.

1 Like

The 110M llama2 model with TinyStories data set is up & running on main-net !!

This is quite a jump from the previous 15M model.

The performance is still remarkably good. Each call comes back after ~10 seconds.
Try it out:

dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" new_chat '()'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  "obk3p-xiaaa-aaaag-ab2oa-cai" inference '(record {prompt = "" : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'

This size LLM generates a very coherent story and doesn’t suffer from some of the deficiencies of the 15M size model:

--------------------------------------------------
Generate a new story using llama2_110M, 10 tokens at a time, starting with an empty prompt.
(variant { ok = 200 : nat16 })
(variant { ok = "Once upon a time, there was a little girl" })
(variant { ok = " named Lily. She loved to play outside in the" })
(variant { ok = " sunshine. One day, she saw a big" })
(variant { ok = ", red apple on a tree. She wanted to eat" })
(variant { ok = " it, but it was too high up.\nL" })
(variant { ok = "ily asked her friend, a little bird, \"Can" })
(variant { ok = " you help me get the apple?\"\nThe bird said" })
(variant { ok = ", \"Sure, I can fly up and get" })
(variant { ok = " it for you.\"\nThe bird flew up to" })
(variant { ok = " the apple and pecked it off the tree." })
(variant { ok = " Lily was so happy and took a big bite" })
(variant { ok = ". But then, she saw a bug on the apple" })
(variant { ok = ". She didn\'t like bugs, so she threw" })
(variant { ok = " the apple away.\nLater that day, L" })
(variant { ok = "ily\'s mom asked her to help with the la" })
(variant { ok = "undry. Lily saw a shirt that was" })
(variant { ok = " too big for her. She asked her mom, \"" })
(variant { ok = "Can you make it fit me?\"\nHer mom said" })
(variant { ok = ", \"Yes, I can make it fit you.\"" })
(variant { ok = "\nLily was happy that her shirt would fit" })
(variant { ok = " her. She learned that sometimes things don\'t fit" })
(variant { ok = ", but there is always a way to make them fit" })
(variant { ok = "." })
(variant { ok = "" })

Will expose it in the front end shortly, after more testing. Something is not working yet when you use a non-empty prompt.

7 Likes

You are making good progress.

The frontend at https://icgpt.icpp.world now uses the 42M model as the default:

image

This size model is a lot better than the 15M model, and it determines by itself when it has completed a story. For example, the prompt Bobby wanted to catch a big fish results in a generative story that ends itself:

image

Response time of this model is just as good almost as the 15M model, and the words stream onto the screen very naturally.

However, even though the stories are a lot better than the 15M model, this 42M model still throws some nonsensical stuff in there, as in this example. The prompt Bobby was playing a card game with his friend results in a pretty good story, but some of the comprehension is not right and there is a repeat in the middle. This should go away with the 110M model :crossed_fingers:
image

2 Likes

A real fun one to use is the prompt:
Billy had a goat.

1 Like

Did you just put a LLM inside a canister :exploding_head:

@Artemi5 ,
yes, the canister is very capable in running an LLM !
The goal is to scale it up more & more now.

4 Likes

This is really impressive stuff, especially the streaming effect. I had no idea that was possible. Thanks for open-sourcing this!

Have you done any cost analysis, e.g., cost/token in-canister?

Is there a simple way to donate cycles to the canister?thank you!

1 Like

Curious to know if anyone has used the new Mistral 7B open-source offering?

thanks!

1 Like

Here’s a few ways to do it: Topping up canisters - Internet Computer Wiki

1 Like

thank you very much :kissing_heart:

1 Like

obk3p-xiaaa-aaaag-ab2oa-cai is the Canister ID ?


There was an error :sweat_smile:

Nono, it’s all fine. The error means that you are not a controller, but you can still use ‘Add Cycles’. I added a note to the guide that this is fine

1 Like

You are right,after closing the pop -up window, the button Add cycles still there.Thanks for your help


!

1 Like