Introducing the LLM Canister: Deploy AI agents with a few lines of code

Hey everyone!

Based on the discussions we’ve been having at the DeAI working group, I’m happy to share an announcement regarding building AI agents on the IC.

tl;dr

You can now access LLMs from within your canisters with just a few lines of code.

Demo Agent

Prompting

Rust:

use ic_llm::Model;

ic_llm::prompt(Model::Llama3_1_8B, "What's the speed of light?").await;

Motoko:

import LLM "mo:llm";

await LLM.prompt(#Llama3_1_8B, "What's the speed of light?")

Chatting with multiple messages

Rust:

use ic_llm::{Model, ChatMessage, Role};

ic_llm::chat(
    Model::Llama3_1_8B,
    vec![
        ChatMessage {
            role: Role::System,
            content: "You are a helpful assistant".to_string(),
        },
        ChatMessage {
            role: Role::User,
            content: "How big is the sun?".to_string(),
        },
    ],
)
.await;

Motoko:

import LLM "mo:llm";

await LLM.chat(#Llama3_1_8B, [
  {
    role = #system_;
    content = "You are a helpful assistant.";
  },
  {
    role = #user;
    content = "How big is the sun?";
  }
]) 

How does it work?

The processing of LLM prompts is done by what we call “AI workers”. These are stateless nodes set up for the sole purpose of processing LLM prompts.

Here’s a clarifying diagram:

  1. Canisters send prompts to the LLM canister.
  2. The LLM canister stores these prompts in a queue.
  3. AI workers continuously poll the LLM canister for prompts.
  4. AI workers execute the prompts and return the response to the LLM canister, which returns it to the calling canister.

Note that this release is an MVP. We’ll be iterating quickly based on feedback, and potentially making breaking changes along the way. Rather than trying to figure out the perfect API and design before enabling you to build AI use-cases on the IC, we’re taking here an iterative approach. To that end:

  • The LLM canister is, for now, controlled by the DFINITY team. Once it’s stable, control will be given to the protocol.
  • The AI workers are, also for now, managed by the DFINITY team. We’ll work towards making them decentralized, and there are various ways to achieve this. Which way to go will depend primarily on your feedback and preferences. There will be more to share about decentralization designs in the future.

Resources

Libraries

We’ve created some libraries to facilitate using the LLM canister:

Introducing similar libraries in other languages, such as Typescript or Python, would be a very welcome contribution :slight_smile:

Examples

Quickstart Example

A very simple example that showcases how to get started building an agent on the IC.

FAQ

Q: What models are supported?
A: For now, only the Llama 3.1 8B is supported. More models, based on your feedback, will be made available.

Q: What is the cost of using the LLM canister?
A: It’s free for now. As the system and use-cases mature we can evaluate and set the costs accordingly.

Q: Are there any limitations on the prompts I can make?
A: Yes. We’ve added a few limitations on how big the prompts and answers can be, and will gradually improve these over time:

  • A chat request can have a maximum of 10 messages.
  • The prompt length, across all the messages, must not exceed 10KiB.
  • The output is limited to 200 tokens.

Q: Are my prompts private?
A: Yes and no. The Internet Computer as a whole doesn’t yet guarantee confidentiality, and the same is true for the AI workers. Someone who has an AI worker running can in theory see the prompts, but cannot identify who made the prompt. For DFINITY specifically, we do not log these prompts, but do log aggregated metrics like the number of requests, number of input/output tokens, etc.

One area we’re exploring is using TEEs for the AI workers, but these do come at a cost, and determining whether or not it would make sense would depend on your use-cases and feedback.

Q: What is the principal of the LLM canister?
A: w36hm-eqaaa-aaaal-qr76a-cai

Q: Where is the source-code of the LLM canister and the AI worker?
A: These are not yet open-source, as the current implementations are mostly throw-away prototypes, but they will be open-sourced eventually as this work matures.

Next Steps

  • Improving the latency
    We’ll be looking into ways to improve the end-to-end latency of these requests, as well as supporting it in non-replicated mode for snappier experiences that don’t require the same level of trust.

  • Decentralize the AI workers powering the LLMs
    We have a few options, ranging from deploying these workers across node providers, or potentially exploring the idea of “badlands”, where anyone with the right hardware can permissionlessly run an AI worker from their home or data center of choice.

Final thoughts

This work deviates from the approach we’ve been exploring previously at DFINITY, which is to try to run the LLMs inside a canister. We’ve done our research and identified some key bottlenecks to improve the performance of these models, including improving the way canisters handle I/O and optimizing matrix multiplication. However, even with these optimizations, it would only be possible to run models in the ~1B parameter range. For the larger and more useful models (8B, 70B, and beyond), we needed to rethink our approach.

In the long term, we’ll continue investing in the performance of canisters to enable running larger models, but in the meantime, we think the approach outlined here of using AI workers would unlock AI use cases on the IC much more quickly and reliably, and without sacrificing the IC’s trust model.

We’d love to get your feedback on this work. Try it out, build some agents, and let us know what use cases you have in mind. Let us know if there’s anything we can help you with.

52 Likes

In the name of Seers team, we love you guys.

5 Likes

Huge milestone! This marks the beginning of unlocking numerous use cases for developers. We @RuBaRu will explore how we can leverage this and request the necessary models. Kudos👏

4 Likes

This is a big achievement. Shoutout to all the folks who have worked hard in the DeAi technical working group and at Dfinity to bring LLMs to reality on ICP via a few lines of Motoko :rocket: code.

6 Likes

This is huge, guys, thanks! We will start developing some features and come back with specific feedback. In principle, the obvious extensions include adding more models like Grok3, the latest from Anthropic and OpenAI, and extending the output limit.

Creating a Badlands for AI compute could be awesome, something like Prime Intellect.

One feature we want to implement in Seers is a truly autonomous DAO. This means the DAO would be controlled by a set of AI models that learn from user feedback, implement changes, and deploy code through proposals. To me, this should be the ultimate goal of Caffeine—every DAO on the IC becoming Fully Autonomous.

Impressive work @ielashi. Curious about the AI worker (is it GPU powered?) and your plans for the future regarding bigger/more powerful models. Specially curious about determinism and GPUs.

6 Likes

Great! I will be testing this and provide feedback

1 Like

Once you have a did file available, should be dead simple to add support for the LLM canister to Azle canisters.

I assume the LLM canister’s Candid isn’t that crazy to work with right? Azle canisters work off of auto-generated IDL type objects (like from @dfinity/candid).

Please make the Candid file very nice, giving all parameters and return types their own type names.

Looking at the Candid now: https://dashboard.internetcomputer.org/canister/w36hm-eqaaa-aaaal-qr76a-cai

It would be nice if all complex nested types had their own names, and if you could create enums for the models…if you made the types a bit better, I’m thinking a special library in Azle wouldn’t be necessary, as the generated IDL type objects and TypeScript types would be very declarative.

3 Likes

Thanks for the great feedback, everyone :slight_smile: Really excited to see what you build with this.

That’s interesting. We have been focusing on open-source models because those, with proper replication, can be decentralized across nodes. Closed models, though, cannot fundamentally be decentralized and therefore accessing them via the AI workers wouldn’t really give you more trust vs making HTTPS outcalls to these APIs directly. Do you see it differently?

Agreed. This would be a great topic to discuss with the DeAI working group. I plan to explore this in more depth over the coming weeks, and maybe then I can share a concrete proposal as what this can look like. The fact that AI workers are stateless makes it substantially simpler to decentralize in a permission-less way compared to a subnet.

Thanks @dralves, great to hear from you! :slight_smile: The current node is an MVP that’s using an external provider to process the prompts, but over the coming weeks we’ll change that so that the inference is done on the nodes themselves, and for that we’ll very likely be using GPUs.

Regarding determinism, you’ll notice that the LLM canister doesn’t give you the same response for the same prompt. That’s because it uses the random beacon as the seed when responding to a prompt. We can change that, however, based on feedback, or even expose the seed as a parameter if there are applications that need determinism for the same prompts.

As for the determinism given a particular seed, the deep learning libraries that I had tried were not deterministic when using GPUs, specifically the sampling once you compute the logits. However, this non-determinism I believe is due to the implementation of the sampler rather than the hardware of the GPUs themselves, and so one of things we’ll look is potentially re-implementing the sampler to get deterministic outputs for the same seeds on GPUs.

Sure, let me add names to all the candid types. Enums can be a pain to work with in Candid and can impose limitations on the interface, so that’s why I opted to use strings in the LLM canister API and introduce enums at the library level.

4 Likes

I see. It would be great if the API could support closed models as well. Otherwise, we’d need to handle deduplication of outputs and often rely on a Web2 gateway to interact with them, and people will need to trust each specific dev team. Everyone will end up implementing similar code. Also, even when models are open, predicting their responses to specific inputs can be quite challenging. So I don’t see them as totally ‘open’ until we can create/train them in a decentralised way. Grok seems to have a strong advantage over open models now, so having support for it would be cool.

4 Likes

Indeed great work @ielashi !

Chiming in on candid interface.

Would be great if v0_chat could return a Result type.

A couple thoughts here:

-I pretty much use only closed ais so those would be nice
-there has been a latent need for untrusted out calls for a while…so maybe this kills two birds with one stone.
-this brings up the api key problem
-maybe tee can help with that

  • what if the reply to th call came back in via. Function call…this way we at least have attribution of who ran the code and what their answer is. This would open up the possibly slashing or other kinds of tokenimic mechanisms to increase trust.
2 Likes

I created a python notebook to interact with it:

@ielashi , will you introduce the role assistant soon and expand the number of tokens allowed?

I set up my notebook in anticipation of that, so we can have a back & forth conversation.

4 Likes

The idea of supporting closed models brings up interesting questions. @Mar @skilesare do you have specific use-cases in mind, and do these use-cases necessitate closed models?

That’s fantastic, thanks for the great work! :slight_smile: Yes, I just added the assistant variant to the LLM canister today, and updated both the Rust and Motoko libraries accordingly. Maybe once this matures we can wrap it in a python package?

2 Likes

Yes, I was thinking the same. We should be returning the role from the LLM, in addition to whether the LLM has finished generating the output tokens or hit the limit (which is set to 200 for now). Do you see other data that would be helpful to return?

1 Like

Different models excel at different use cases. Some are better at simulating personas, while others are more suited for research. Currently, the best and least “lobotomized” model is Grok 3. Previously, we used Claude for its response quality, particularly in AI personas.

You can likely achieve similar results with open models, but at a lower quality, which may make your app less competitive. For example, some models refuse to answer certain questions, while others respond as truthfully as possible.

On the other hand, we are also interested in model diversity. Since we plan to use them to help govern the DAO, we might, for instance, present them with a proposal and have them vote. Different models may vote differently because they are trained on different sets of rules. Even if your dapp is focused on trading, you might still want to gather opinions from a diverse set of models.

For example, some models might have a favorable view of figures like Trump and Musk, while others might be more critical. Similarly, some might discuss Taiwan in a more neutral or diplomatic way, while others might take a strong stance. These differences in perspective can be valuable when designing systems that require nuanced decision-making and need to be robust.

Hopefully, over time, we can become less and less dependent on closed-source models.

2 Likes

It is just where my experience is and the easiest for me to optimize against because I dont have to spin up hardware/vms to use it.

Azle 0.27.0 has just been released. It has support for the LLM Canister. See the demo code here.

import { call, IDL, update } from 'azle';
import { chat_message, chat_request, role } from 'azle/canisters/llm/idl';

export default class {
    @update([IDL.Text], IDL.Text)
    async chat(prompt: string): Promise<string> {
        const role: role = {
            user: null
        };

        const chatMessage: chat_message = {
            role,
            content: prompt
        };

        const chatRequest: chat_request = {
            model: 'llama3.1:8b',
            messages: [chatMessage]
        };

        const response = await call<[chat_request], string>(
            'w36hm-eqaaa-aaaal-qr76a-cai',
            'v0_chat',
            {
                paramIdlTypes: [chat_request],
                returnIdlType: IDL.Text,
                args: [chatRequest]
            }
        );

        return response;
    }
}
6 Likes

FWIW, we just switched back to a closed model (Claude 3.7) because this Llama was generating stuff like this: