Llama2.c LLM running in a canister!

Those of you who are into conversational AI are probably aware of this model that has taken the open source world by storm: karpathy/llama2.c

Using icpp-pro, I am able run this LLM in a canister and do inference:

dfx canister call llama2 inference '(record {"prompt" = "" : text; "steps" = 20 : nat64; "temperature" = 0.9 : float32;})'
(
  variant {
    ok = "Once upon a time, there was a little boat named Bob. Bob loved to float on the water"
  },
)

I created this video that shows the full process of build/deploy/upload/test.

You can find the code in icppWorld/icpp-llm

I believe this is a foundational step in bringing Conversational AI to the IC. Once the infrastructure scales and some limitations are removed, we will be able to scale this up to larger & larger AI models running directly on the IC, without the need for doing the inference on another cloud.

48 Likes

What is the average inference time? If it takes longer than the consensus period, what happens?

@ildefons ,
Right now, the inference call is a query call, so it does not need to go through consensus.

It is very fast to get the response back from this tiny LLM. I plan to activate the timer, and I can give you the tokens/sec.

Good question though about what would happen if we switch to an update call. I can see this being needed in case you want to save the inference results and also chat histories similar to what ChatGPT does. For this tiny LLM there will be no issue, but I wonder what will happen once the Orthogonal Persistence memory becomes larger and we can deploy really big models.

7 Likes

Very nice! Can’t wait to see more your progress :+1: :+1:

1 Like

This is VERY cool. Can we chat about it? You can DM me @BobBodily on Twitter, Discord, or Telegram. Would love to learn more.

4 Likes

Try it

Deployment to main-net went smooth.

icpp_llama2 with the stories15M.bin is now running on-chain in canister 4c4bn-daaaa-aaaag-abvcq-cai.

You can call it’s inference endpoint with:

dfx canister call --network ic 4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = "" : text; steps = 20 : nat64; temperature
 = 0.8 : float32; topp = 1.0 : float32;})'
(
  variant {
    ok = "Once upon a time, there was a little boat named Bob. Bob loved to float on the water"
  },
)

If you play with the parameters, you will quickly run into the instructions limit. That is an area of investigation right now.

7 Likes

I like to use this thread to share the high level roadmap I have in mind for icpp-llm, and keep you posted on the progress each time I reach a milestone or encounter a blocker.

Milestone 1: remove the limitations of the current tinystories canister, so it can generate stories longer than 20 words.
This means I need to find a way to work around the max instructions per message limit.

Milestone 2: run inference with memory and matrix calculations distributed across multiple canisters.
For this, I plan to use an HPC type appoach, kind of treating the IC as a massively parallel compute cluster.

Milestone 3: Run inference with the llama2_7b_chat model. Not worrying about speed, just the ability to load it and talk to the LLM.

Milestone 4: Optimize and scale.

This is going be a fun challenge.

7 Likes

COMPLETED Milestone 1: remove the limitations of the current tinystories canister, so it can generate stories longer than 20 words.

I fixed it by:

  • Switching from using a single query call to a sequence of update calls
  • Save the LLM’s runState in orthogonal persistence
  • Implement a new endpoint new_chat, to reset the runState when you’re starting a new chat.

This screenshot shows how the call sequence works with dfx:

  • You start with a call to the new_chat endpoint
  • You then generate a starting prompt by a sequence of update calls.
  • You then ask the LLM to write the rest of the story, using more update calls.

So, for this small LLM:

  • there is NO limit anymore on the length of the prompt or the chat.
    (It is only limited by the seq_length used during the model training, but that value is typically huge.)
  • The 10 token increments come back in ~3 seconds in my local network.
  • The remaining limits are the size of the transformer we can store in a canister, and as we use bigger transformers, the inference time.

NOTE: I did not yet update the canister on the main-net, because I first need to build in a multi-user approach for the inference endpoint. That is to be done.

9 Likes

Thanks for sharing the progress! Looking forward to trying on mainnet!

2 Likes

Multi-user is now supported by icpp-llm and it is again deployed to main-net canister 4c4bn-daaaa-aaaag-abvcq-cai !

It is deployed to main-net, and to see it in action, check out this short loom video, where I run two concurrent shell scripts, and two stories are being generated by the LLM for two different principals.

If you want to try it out yourself, you have to still use dfx, because I did not yet build a frontend:
(Note, on Windows, use wsl --% dfx ....)

dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai new_chat '()'
# You can build the prompt in multiple calls
dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = "Lilly went to"           : text; steps = 0  : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = "the beach this morning." : text; steps = 0  : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = "She saw a little boat"   : text; steps = 0  : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = "with her friend Billy"   : text; steps = 0  : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
# Followed by building out the story
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call  --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = ""                        : text; steps = 10 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
dfx canister call --network ic  4c4bn-daaaa-aaaag-abvcq-cai inference '(record {prompt = "" 

@domwoe has pointed out that doing client side orchestration like this is good & flexible, but there are benefits of moving the orchestration calls on-chain as well, for speed & cost reasons. We’re going to look into that, but I wanted to push this new capability out asap.

4 Likes

Hey @icpp,

how do you handle randomness at the moment? Is the same prompt giving the same answer or is icpp taking care of this and uses the randomness API of the Internet Computer?

2 Likes

That is a great question! The way randomness is introduced right now, and this is just the original code from llama2.c, is by starting with a random seed, based on the time in ns. For that I am using the canisters system api for time.

Then a pseudo random number is generated from this, by using some bit twiddling. I did not study that aspect much yet.

I will check out the IC randomness API. That sounds like a really interesting improvement over the current approach.

…btw… the LLM has two settings. If you set the temperature parameter to zero, you will get the same story each time. If you put it to non-zero, the randomness will be introduced as explained above, and the randomness is used by the sampler who picks the next token from the list of most probable tokens predicted by the LLM.

7 Likes

Unfortunately I had to make the icpp-llm repository private for now.

The hosted version of the on-chain LLM is still active and will continue to improve rapidly.

UPDATE on Sep 25, 2023:

  • The repo is public again, although renamed: icpp_llm
2 Likes

Hi All,

Some of you reached out asking why I made icpp-llm repository private.

The main reason is that I wanted to bring some sanity to the topic of LLMs running in canisters of the IC.

One has to understand that it is a very interesting and promising field of research. But that really is what the state-of-the-art is. It is R&D, and a lot of hard work still has to be done to make it a reality.

To enable sharing of my research in LLMs running in canisters, I created a dApp that allows collaborators and partners to exercise the available & deployed LLMs.

The name of my dApp is ICGPT Labs.

  • ICGPT = Internet Computer Generative Pre-trained Transformers
  • Labs = it is research! We are making amazing strides, but the models today are still small

The Beta Preview of the frontend has now been deployed. It is still a little rough around the edges, but I think it is already very cool to see a live token stream coming back from the LLM canister.

Feel free to try it out and give me feedback. You must login with your Internet Identity, and then you can build stories. All real, all live on main-net !.

You can find here: https://icgpt.icpp.world/

A few screenshots to wet your appetite:

The login screen :

image


The New Chat page allowing you to select the LLM you want to try and an input area to provide the prompt. Right now, we only have one LLM deployed, the Tiny Stories model with 15M parameters which has been the workhorse of my R&D so far:


Once you click Submit, the frontend connects to the selected LLM and starts the inference.


The tokens stream in progress!

Once the initial part of the story has been build, you can continue to build it out, or start over.

7 Likes

As I mentioned, it is Beta Preview…:

  • I will update the dApp without notifications. If it is not active, please try again after some time

If you want to provide feedback, please do it in this thread, or in the C++ community on OpenChat.

6 Likes


:heart_eyes:

2 Likes

I updated the styling for ICGPT Labs and it has a custom domain.

It now works a lot smoother on mobile, and the logic to create longer stories in multiple steps is also fixed.

Unless I get requests for further styling improvements, I am going to focus next on building out the LLMs, in this order:

  1. Tiny Stories, 15M, fine tuned for Chat. That will allow you to describe to the LLM what kind of story you want, and is a more natural experience.
  2. Tiny Stories, 42M & 110M, fine tuned for Chat. See if I can get these larger models on a canister. The 15M model already produces pretty decent stories, but these larger models do a much better job at producing comprehensive stories with proper grammar and a sensible plot.
  3. Llama2, 7B, fine tuned for Chat. This will be the milestone we need to reach, for real applications. It will be a huge challenge, but I am very encouraged with the results so far.

I had a great zoom call today with @LowFreeKey , to brainstorm how a canister based LLM could fit in kawak and it is felt that once we reach the Llama2-7B level and response time is reasonable, it would have potential. The kawak dApp is focused on privacy and having to send data out to an off-chain LLM at eg. OpenAI would break that.

9 Likes

Who will pay for the traffic?

If you generously present LLMs for free, where do you obtain funds?

I started to think using Llama 2 to check my upcoming social network posts for spam, like

Is the following a spam (answer "yes" or "no")?

XXX

and then check response.lowerCase().startsWith('yes') (pseudocode).

So, I am going to call Llama 2 on every of my posts. That may amount to a significant amount of funds to be paid for cycles. Is it OK for you?

@qwertytrewq ,
thank you for sharing that use case.

Let me answer it with a couple of question :slightly_smiling_face:

  • Would you be willing to pay such a service?
  • If no, would you be interested in an ad supported offering?
  • If yes, would you prefer fixed subscription fee or usage based?

Would you be willing to pay such a service?
Yes, but only if the price is minor.

If no, would you be interested in an ad supported offering?
It is usage through API. I don’t understand where the ads can be put in this relationship.

If yes, would you prefer fixed subscription fee or usage based?
Usage-based is more fair for both parties.

I also can deploy my own canister, but I doubt that it has a value to duplicate functionality.

BTW, what are obstacles due which Llama 2 is not yet deployed, but only TinyStories are deployed?