Llama.cpp on the Internet Computer

icpp · July 21, 2024, 4:30pm

This thread discusses llama.cpp on the Internet Computer.

A project funded by the DFINITY Grant: ICGPT V2

The first functioning version is now MIT licensed open source:

icpp · July 21, 2024, 4:31pm

Current status:

on a Mac, you can build/deploy/upload the LLM, and then call an endpoint to have it generate tokens.
it only works for a small model, because no use is made yet of the recent advancements of the IC (SIMD, float handling, etc.).
the README of the GitHub repo contains a list of TODOs

icpp · October 15, 2024, 6:21pm

It’s been a journey, but a pre-release of ICGPT with a llama.cpp backend is now live on the IC.

The deployed llama.cpp model is Qwen 2.5 - 0.5B - Q8_0 - Instruct
You can watch a small video here
You can try it out at https://icgpt.icpp.world/
A 0.5B model with q8_0 quantization fits fine in a 32bit canister.
However, because of the instruction limit, which requires multiple update calls, it takes about 2 minutes to get this answer to the question shown below
We did not do any load testing, so it will be interesting to see how it holds up when multiple users try it out at the same time.
The UI is still primitive. The same as the one that was developed for the tiny story teller LLM. Improving that is on the to-do list.

icpp · October 30, 2024, 4:47am

ICGPT V2 - final milestone reached

(I also posted this on X)

The grant work is now completed and here is a video summarizing what I created. I want to thank @dfinity for the opportunity & the support.

I also want to thank the #ICP community for the enthusiasm as I shared progress along the way over the past months, and the testing some of you did with the early releases.

Some of you even donated cycles that will keep the Qwen2.5 canister up & running for several months. You are the best.

You can try it out at: https://icgpt.icpp.world

I am very happy with the outcome of this project and there are big plans to build on top on this foundation. More on this later. But first some time to celebrate this milestone

Youtube Video

paulous · October 30, 2024, 5:49am

Worked better this time

superduper · November 7, 2024, 11:52pm

congrats super cool i hope this gets the use and attention it deserves

btw i’m playing with a solana project https://github.com/ai16z/eliza/ which can connect to to remote or local LLMs

does this have API end points so that we could add it as one of the remote models? i can only pidgeon code so hard for me to eval how it works

github.com

ai16z/eliza/blob/main/packages/core/src/core/models.ts

import settings from "./settings.ts";
import { Models, ModelProvider, ModelClass } from "./types.ts";

const models: Models = {
    [ModelProvider.OPENAI]: {
        endpoint: "https://api.openai.com/v1",
        settings: {
            stop: [],
            maxInputTokens: 128000,
            maxOutputTokens: 8192,
            frequency_penalty: 0.0,
            presence_penalty: 0.0,
            temperature: 0.6,
        },
        model: {
            [ModelClass.SMALL]: "gpt-4o-mini",
            [ModelClass.MEDIUM]: "gpt-4o",
            [ModelClass.LARGE]: "gpt-4o",
            [ModelClass.EMBEDDING]: "text-embedding-3-small",
        },

This file has been truncated. show original

icpp · November 8, 2024, 12:25pm

@superduper ,
thanks for your feedback and for pointing out the eliza model. I will check it out.

About the API:

There are two endpoints new_chat & run_update. These are candid based canister endpoints. The links bring you to the candid service definition. In the README it is described how you call it using dfx. This API is using the llama.cpp command line interface, and makes it really easy to test things locally, and then use the same arguments when calling the canister.
We’re looking at creating another API that is following the openAI completions standard.

roger · November 22, 2024, 4:50pm

Congratulations on the final milestones! Must have been some journey.

We just got done with our first milestone of LLM Marketplace and will be getting into an interesting phase. In this phase I was planning to explore Tiny llama (1.1 B parameters) and train it for a niche task.

But looking at your post I’m having second thoughts, primarily because your model is smaller than tiny llama and quantised. Despite, it hits the instruction limit restriction. I will experiment and share my learning.

I’m trying to brainstorm ideas to see what could be a tiny task it could be trained on bypassing the instruction limit.

Would appreciate your thoughts around it.

Cheers!

icpp · November 22, 2024, 8:38pm

Hi @roger ,

The 1.1B Tiny Llama will not fit.

I recommend you select a 0.5B parameter model, like the Qwen 2.5 model I am using, and try to fine tune that one.

roger · November 23, 2024, 11:21am

Yes, I have been contemplating to use Qwen and also exploring few other light wt models like SmolLM, DistilGpt2.

While researching these in also trying to close on a fun use case.

Thank you for the suggestion.

icpp · February 2, 2025, 3:16pm

We completed the update to the latest llama.cpp version (sha 615212).

Please start fresh by following the instructions at GitHub - onicai/llama_cpp_canister: llama.cpp for the Internet Computer

This update allows you run many new LLM architectures, including the 1.5Billion parameter DeepSeek model that attracted a lot of attention with this X post.

The main limiting factor of running the larger LLMs is the instructions limit. If a model can generate at least 1 token, you can use it, because we generate tokens via multiple update calls. (See the README in the repo for details.)

Latency is off course high, which hopefully will improve with further ICP protocol and hardware updates, but we believe it is already possible to build useful, targeted AI agents with their LLM running on-chain. It requires some smart prompt engineering, and this is an area where we are focusing our efforts.

To assist with prompt engineering, a python notebook prompt-design.ipynb is included in the repository, where you can run against the original llama.cpp compiled for your native system.

icpp · February 2, 2025, 3:38pm

A few notes on the testing we did with DeepSeek.

We tested this DeepSeek model, available on HuggingFace:

unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF · Hugging Face
the Q2_K model: DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf

The model card on Huggingface shows this llama.cpp command:

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
    --cache-type-k q8_0 \
    --threads 16 \
    --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' \
    -no-cnv

Our initial tests confirmed that the parameter --cache-type-k q8_0 is important to get a good answer from the Q2_K quantized model.

To call the canister, you would use something like this:

dfx canister call llama_cpp run_update '(record { args = vec {"--cache-type-k"; "q8_0"; "--prompt-cache"; "prompt.cache"; "--prompt-cache-all"; "-sp"; "-p"; "<｜User｜>What is 1+1?<｜Assistant｜>"; "-n"; "512" } })'

(No need to pass -no-cnv, because that is a default for llama_cpp_canister)

You can generate 2 tokens per update call, so you configure the LLM with this call (Details are in README of repo):

dfx canister call llama_cpp set_max_tokens '(record { max_tokens_query = 2 : nat64; max_tokens_update = 2 : nat64 })'

And that’s really it to get going with this model.

icpp · March 19, 2026, 12:49pm

Hi all,
I have picked up work again on llama_cpp_canister and the icpp-pro C++ CDK that is an enabler of it.

Since last post in this thread (13 months ago) we successfully used llama_cpp_canister with the Qwen2.5-05b-instruct model inside the funnAI application, and it is running smoothly.

Just released llama_cpp_canister v0.9.0, where we upgraded the model upload scripts from ic-py to the latest icp-py-core 2.3.0

I will post updates in this thread.

If you have feature requests for either the C++ CDK or llama_cpp_canister, let me know.

Henry_Suso · March 19, 2026, 1:11pm

This is great!

I have recently patched your code to handle upstream llama cpp and bitnet i2s

We can run some .5b .8b models really efficiently in chat formats, anything higher gets caught up in instruction limits.

The best we have running rn are Falcon H1 .5b and Qwen 3.0 in our production.

In our test canister we have: Falcon E 1b i2s instruct running and Qwen 3.0 .6b running. Bitnet 158 large also runs, but it’s not tuned yet and is a base model so no good for chatml. We are basicly working on some edge cases before we move Falcon E 1b to production in our frontend for users to play with.

Our mail goal is to depreciate the Dfinity LLM canister which I believe is off chain? And hook our agent toolchain into Falcon E or some other comparable model.

I’ll be pushing the v5 wasm and the bitnet i2s wasm to this repo shortly

icpp · March 19, 2026, 3:06pm

@Henry_Suso

Very cool. Are you planning to create a PR back into llama_cpp_canister or forking off?

btw… I am fine either way – great that you’ve picked this up, and I will be more than happy to support from C++ side.

There are some things I want to add to the C++ CDK in support of running AI workloads and other high compute applications.

icpp · March 19, 2026, 3:07pm

Yes, that is indeed off chain. I do not know where the off chain LLM is running.

icpp · March 19, 2026, 3:15pm

Yes, that was also my experience. I did some detailed studies and summarized it in this table. The biggest one we could fit was Deepseek R1 with 1.78B parameters, but it can only produce 2-4 tokens before hitting the instructions limit – pending on the level of quantization.

Appendix A: max_tokens

The size and settings for models impact the number of tokens that can be generated
in 1 update call before hitting the instruction limit of the Internet Computer.

The instruction limit is 40 billion instructions per update call

We tested several LLM models available on HuggingFace:

Model	# weights	file size	quantization	–cache-type-k	maxtokens (ingestion)_	maxtokens (generation)_
SmolLM2-135M-Instruct-Q8_0.gguf	135 M	0.15 GB	q8_0	f16	-	40
qwen2.5-0.5b-instruct-q4_k_m.gguf	630 M	0.49 GB	q4_k_m	f16	-	14
qwen2.5-0.5b-instruct-q8_0.gguf	630 M	0.68 GB	q8_0	q8_0	-	12
Llama-3.2-1B-Instruct-Q4_K_M.gguf	1.24 B	0.81 GB	q4_k_m	q5_0	5	4
qwen2.5-1.5b-instruct-q4_k_m.gguf	1.78 B	1.10 GB	q4_k_m	q8_0	-	3
DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant.gguf	1.78 B	1.34 GB	NexaQuant-4Bit	f16	4	3
DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf	1.78 B	1.46 GB	q6_k	q8_0	4	3
DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf	1.78 B	1.12 GB	q4_k_m	q8_0	4	3
DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf	1.78 B	0.75 GB	q2_k	q8_0	2	2

Henry_Suso · March 19, 2026, 4:51pm

I can do that. If I’m honest I am not very smart and don’t really know how to do it. I will try to make a pr!

Henry_Suso · March 19, 2026, 4:57pm

This is very cool! My current v5 wasm I have qwen3.0 .6b running at 22 tokens and smollm2 360mb running at 50 tokens

Topic		Replies	Views
Llama2.c LLM running in a canister! Programs & Applications	61	5036	July 1, 2024
Introducing the LLM Canister: Deploy AI agents with a few lines of code Developers rust , DeAI	76	4893	September 1, 2025
How can I take an open source pretrained LLM model, deploy it to ICP and use as a private ChatGPT just fo me Developers	13	495	June 16, 2025
Is llama-13B(or 7B) LLM possible to deploy on canister? Developers Discussing	5	565	October 8, 2025
AI and machine learning on the IC? Developers	114	10520	June 20, 2024

Llama.cpp on the Internet Computer

Appendix A: max_tokens

Related topics