Introducing the LLM Canister: Deploy AI agents with a few lines of code

How large could the token output be in the future? 200 tokens now is very limited.

I agree it is very limited. We plan to increase this to 500 within days, and we’ll be working towards increasing it as we gain more confidence in the stability of the system.

How large is the token input now, and any idea of the future?

The input now is bound to 10KiB. In principle we can get it up to 2MiB, which is the maximum request size. Beyond 2MiB will require some chunking. Are you already hitting the 10KiB limit?

Do you have plans to incorporate Llama 4? Any news or plans about it?

Great question. We’re doing some research to see how well we can support it. More on that soon :slight_smile:

2 Likes

Quick update: the limit on the number of tokens that can be output per request has been increased from 200 to 1000. The former has been, as already suggested, quite constraining, and now that the system has proven to be stable, we don’t see a problem with handling the additional load.

4 Likes

Thanks for the answers.

This reality confirms for me that we must use HTTPS calls to external open source LLMs.

I do hope that Open AI, Anthropic and Google support IPv6 addresses. Given what I’ve seen from CaffeineAI, that seems to be the case.

The use case for this LLM Canister seems to be quite limited for now, I am trying to think what could be a useful application… not sure so far. Still great to see the research effort, and I hope we eventually can run full small size LLM models.

Thanks for the feedback, @josephgranata. Regarding closed-source LLMs, would you prefer to use them because of their quality or is there another reason? We’re looking into adding support for Llama 4, so hopefully in that case the LLM canister can prove to be more useful.

Mr. El-Ashi in fact I do prefer open source models, the best we can possibly use like Deepseek and Lllama 4, however from an ease of building perspective the APIs from Claude, Google and Open AI are worth considering, and the HTTPS route could be a good way to integrate that power into innovative IC AI applications.

That is why it would be good to know if we can use all of them via IPv6 or not? That is why I asked. For open source models it does not matter, because we can host them wherever we want, we can make sure they run in an IPv6 compatible server.

Thanks for the input @josephgranata.

I don’t know personally, but perhaps you can consult their documentation? There’s another challenge, which is making these calls return a deterministic output. Some of these providers provide a seed parameter for determinism, but that determinism isn’t guaranteed.

1 Like

@ielashi: Any chance of adding a model for doing embedding? Maybe one of theses: A Guide to Open-Source Embedding Models

With that we could then be able to create RAG solutions on-chain (using a vector DB in Rust). Even with a 10K input limit, that could expand a good deal of applications that could be created using the LLM Canister.

4 Likes

@ielashi The LLM Canister is really good!

You can see it generating captions for memes using our OC bot at the Open Chat Botathon

An ‘scary’ example:

6 Likes

That’s awesome, thanks for sharing! :slight_smile:

Any chance of adding a model for doing embedding? Maybe one of theses: A Guide to Open-Source Embedding Models

Thanks for the feedback. Which embedding models would you like to use? We can look into it.

2 Likes

Hey Islam,

When I try to run dfx deps pull for the llm canister:

"llm": {
  "type": "pull",
  "id": "w36hm-eqaaa-aaaal-qr76a-cai"
},

On dfx version 0.26.1 and higher it will show the following error message:

Pulling canister w36hm-eqaaa-aaaal-qr76a-cai…
Error: Failed to create and install canister w36hm-eqaaa-aaaal-qr76a-cai
Caused by: Failed to create canister w36hm-eqaaa-aaaal-qr76a-cai
Caused by: Failed to read controllers of canister w36hm-eqaaa-aaaal-qr76a-cai.
Caused by: The replica returned an HTTP Error: Http Error: status 400 Bad Request, content type “text/plain”, content: Canister w36hm-eqaaa-aaaal-qr76a-cai does not belong to any subnet.

But the same command works fine when using dfx version 0.25.0

Do you know why? And also can you make the canister deps pull work for the latest dfx version 0.27.0?

Hey aespieux, this seems to be an issue switching from using replica to PocketIC (which is the case for dfx versions above 0.25.0). The topology for PocketIC is different from the replica one. This is a bug and we’re working on a fix.
So for now, it’s not possible to pull dependencies with dfx version >0.25.0.

2 Likes

LLMs on the Internet Computer - Major Update

Hey everyone, I’m Dave and I recently joined the AI team at DFINITY. I’ve been working alongside Islam on some updates on the LLM canister:

tl;dr
We’ve added tool calling support and two new models (Qwen3 32B and Llama4 Scout) to make building AI agents on the IC even more powerful and accessible.

What’s New

Tool Calling Support

LLMs can now execute functions and interact with external systems! This enables building sophisticated agents that can:

  • Process real-time data
  • Interact with databases
  • Call external APIs
  • And much more!

New Models

We’ve expanded our model selection, this is our currently supported Model list:

  • Qwen3 32B (qwen3:32b) - Powerful multilingual model with excellent reasoning capabilities
  • Llama4 Scout (llama4-scout) - Latest from Meta with improved performance
  • Llama 3.1 8B (llama3.1:8b) - Still available and reliable

Tool Usage Examples And More Details

How Tool Calling Works, an Overview

  1. Define your tools - Specify what functions the LLM can call and their parameters
  2. Send request - Include tools in your chat request using the v1_chat endpoint
  3. LLM decides - The model determines if/when to use tools based on the conversation
  4. Execute function - Your application executes the requested function
  5. Return result - Send the tool result back to continue the conversation

The LLM response includes tool_calls when it wants to execute a function:

{
  "message": {
    "content": null,
    "tool_calls": [{
      "id": "call_123", 
      "function": {
        "name": "get_current_weather",
        "arguments": [
          {"name": "location", "value": "Paris"},
          {"name": "format", "value": "celsius"}
        ]
      }
    }]
  }
}

We’d love to hear your feedback on this LLM integration approach so we can continue refining it. And we’re always curious to learn about the projects you’re building with LLMs, feel free to share what you’re working on!

10 Likes

Tested in chat but get “Sorry, we failed to exucte the bot command”

This is a great upgrade, congrats!!

And right in time for WCHL :wink:

I actually have an important feedback from the Bootcamps / Hackathons. Many teams are “stumbling” on the moment they also want to use OpenAI or any other API and they face the:

  • extra calls of HTTP Outcalls.
  • the difficulty it is to handle different replies from the API model.

I tell them to use the same pattern of AI Worker, but to all, it’s a “hassle” that they aren’t “overcoming” for the duration of a Hackathon. So they are always limited to the models and tools you provide.

But for the future, and to solve this once and for all, thought that it would be great if there was an easy way to “generalize” / “open” the LLM Canister and AI Worker that is fulfilling its requests.

The LLM Canister is open sourced, but not the AI Worker right? Could a version be developed, where it’s stored as a docker image and the teams only need to add an Open AI API Key and deploy it to some off chain cloud (like AWS, Digital Ocean, Vercel, Supabase, etc.).

It only needs to have the same methods/tools as already the Dfinity Models support. If the team wants to use other tooling, they can fork and expand the source code and build a new docker image.

Could this be thought internally? Having the models is already great, and serves maybe 50% of the cases. But for competitive startups, that would like to provide the best AI model out there, having an API friendly version, would be great :+1: And that way, think we would satisfy 99% of the needs.

Just some food for thought :folded_hands:

PS.: tagging @ielashi and @aespieux for visibility.

2 Likes

Hey @ielashi, @ddave and team,

Thanks a lot for the recent updates to the LLM canister, it’s really exciting to see this taking shape!

One suggestion: it would be super useful to support an embedding model for RAG workflows. Something like all-MiniLM-L6-v2 from Sentence Transformers would be a great fit. It’s lightweight, fast, and works really well for semantic search and retrieval.

Would love to see this added, happy to test or help if needed!

1 Like

Thanks for the feedback and suggestions, @tiago89 and @aespieux!

Open Source and Deployment Options

“The LLM Canister is open sourced, but not the AI Worker right? Could a version be developed, where it’s stored as a docker image and the teams only need to add an Open AI API Key and deploy it to some off chain cloud (like AWS, Digital Ocean, Vercel, Supabase, etc.).”

To clarify: neither the canister nor the worker are currently open sourced. However, you can download the canister binary to run it locally, which connects directly to OpenRouter or a local Ollama instance (see our documentation for setup details).

We’re not opposed to open sourcing both components. The main consideration is that integrating with specific LLM APIs (like OpenAI) would still require users to handle more than just adding an API key. They’d need to implement the integration themselves.

@tiago89, would it be helpful if we created an example project as a starting template? This would show how to recreate the same architecture described in the original post. Keep in mind that anyone wanting to use their own setup would still need to deploy the off-chain worker outside of the IC.

Embedding Models for RAG

“One suggestion: it would be super useful to support an embedding model for RAG workflows. Something like all-MiniLM-L6-v2 from Sentence Transformers would be a great fit. It’s lightweight, fast, and works really well for semantic search and retrieval.”

We’d be happy to add more models! However, we’re currently limited to OpenRouter’s available models.

@aespieux, do you know if OpenRouter offers any embedding models that would work for this use case? If not, we could explore other integration approaches for embedding support.

1 Like

Hi @ddave, thanks for the update and openness to expanding support!

Currently, OpenRouter only supports LLMs, there’s no embedding model available via its API as seen on this forum. To enable embedding/RAG workflows, we could download a model like all‑MiniLM‑L6‑v2 from Hugging Face and run it on an AI runner machine adjacent to the node. That runner can expose a simple HTTP service so the node can request embeddings without external API calls.

Proposed enhancement:

Add a default embedding model (e.g. all‑MiniLM‑L6‑v2) in the canister, similar to chat/completion models, with a method like:
embed(text: string) → vector

This would simplify RAG pipelines, keep data in the IC ecosystem, and remove friction of HTTP outcalls.

@aespieux I understand your use case, but unfortunately we cannot support more use cases at this time. We’ve decided to maintain the OpenRouter inference service as-is rather than expanding to additional use cases, since the possibilities are endless and we’d end up trying to support too many different scenarios.

However, I think we can help address the concerns raised by both @aespieux and @tiago89 with an easily reproducible approach. I’ve created a reference implementation of an off-chain worker that can be extended with any kind of model, API, or LLM environment with minimal effort. You would still need to host the canister as well as the off-chain worker (and potentially the model inference) as DFINITY does for the LLM service above. This provides an easier path to what you’re looking for.

Here’s the reference implementation for your use: GitHub - davencyw/ic-off-chain-worker: Internet Computer Off-Chain Worker Example

If you encounter any issues, have questions, or want to propose improvements, please feel free to contribute to the repo or open an issue.

I realize this doesn’t fully solve your specific problems, but I hope it helps to some degree while allowing us to maintain focus on our core offering.

3 Likes

Yes!!

This example is already a lot of what I was looking for :folded_hands:

Me and Antoine will experiment with it and maybe create a guide on it for the WCHL, but looking at the Readme, it looks pretty complete.

It even has analytics / dashboard to see current state of the works. Very interesting indeed, thanks!

Have a great week :folded_hands:

1 Like

So technically, I can deploy and self hosted off-chain worker, which run my LLM. Then I deploy 2 canister, 1 is the LLM on-chain canister, 1 is my chat bot canister. Then I can do the same flow like you:

  1. Chatbot canister call to LLM canister
  2. LLM canister ingress the message → queued it
  3. Off-chain worker poll the queued message from the LLM canister
  4. Process it with local LLM or whatever
  5. Off-chain worker call LLM canister to update the state of the responded message state to proceeded and insert newly responded message
  6. LLM canister call chatbot canister to return message.

Am I right? If I do so what should I concern because I saw that you have working on it to improve and support many things like extend the input/output limit. Can you share some of the difficulties and obstacles you have encountered?