Hey everyone!
Based on the discussions we’ve been having at the DeAI working group, I’m happy to share an announcement regarding building AI agents on the IC.
tl;dr
You can now access LLMs from within your canisters with just a few lines of code.
Prompting
Rust:
use ic_llm::Model;
ic_llm::prompt(Model::Llama3_1_8B, "What's the speed of light?").await;
Motoko:
import LLM "mo:llm";
await LLM.prompt(#Llama3_1_8B, "What's the speed of light?")
Chatting with multiple messages
Rust:
use ic_llm::{Model, ChatMessage, Role};
ic_llm::chat(
Model::Llama3_1_8B,
vec![
ChatMessage {
role: Role::System,
content: "You are a helpful assistant".to_string(),
},
ChatMessage {
role: Role::User,
content: "How big is the sun?".to_string(),
},
],
)
.await;
Motoko:
import LLM "mo:llm";
await LLM.chat(#Llama3_1_8B, [
{
role = #system_;
content = "You are a helpful assistant.";
},
{
role = #user;
content = "How big is the sun?";
}
])
How does it work?
The processing of LLM prompts is done by what we call “AI workers”. These are stateless nodes set up for the sole purpose of processing LLM prompts.
Here’s a clarifying diagram:
- Canisters send prompts to the LLM canister.
- The LLM canister stores these prompts in a queue.
- AI workers continuously poll the LLM canister for prompts.
- AI workers execute the prompts and return the response to the LLM canister, which returns it to the calling canister.
Note that this release is an MVP. We’ll be iterating quickly based on feedback, and potentially making breaking changes along the way. Rather than trying to figure out the perfect API and design before enabling you to build AI use-cases on the IC, we’re taking here an iterative approach. To that end:
- The LLM canister is, for now, controlled by the DFINITY team. Once it’s stable, control will be given to the protocol.
- The AI workers are, also for now, managed by the DFINITY team. We’ll work towards making them decentralized, and there are various ways to achieve this. Which way to go will depend primarily on your feedback and preferences. There will be more to share about decentralization designs in the future.
Resources
Libraries
We’ve created some libraries to facilitate using the LLM canister:
Introducing similar libraries in other languages, such as Typescript or Python, would be a very welcome contribution
Examples
Quickstart Example
A very simple example that showcases how to get started building an agent on the IC.
FAQ
Q: What models are supported?
A: For now, only the Llama 3.1 8B is supported. More models, based on your feedback, will be made available.
Q: What is the cost of using the LLM canister?
A: It’s free for now. As the system and use-cases mature we can evaluate and set the costs accordingly.
Q: Are there any limitations on the prompts I can make?
A: Yes. We’ve added a few limitations on how big the prompts and answers can be, and will gradually improve these over time:
- A chat request can have a maximum of
10
messages. - The prompt length, across all the messages, must not exceed
10KiB
. - The output is limited to
200
tokens.
Q: Are my prompts private?
A: Yes and no. The Internet Computer as a whole doesn’t yet guarantee confidentiality, and the same is true for the AI workers. Someone who has an AI worker running can in theory see the prompts, but cannot identify who made the prompt. For DFINITY specifically, we do not log these prompts, but do log aggregated metrics like the number of requests, number of input/output tokens, etc.
One area we’re exploring is using TEEs for the AI workers, but these do come at a cost, and determining whether or not it would make sense would depend on your use-cases and feedback.
Q: What is the principal of the LLM canister?
A: w36hm-eqaaa-aaaal-qr76a-cai
Q: Where is the source-code of the LLM canister and the AI worker?
A: These are not yet open-source, as the current implementations are mostly throw-away prototypes, but they will be open-sourced eventually as this work matures.
Next Steps
-
Improving the latency
We’ll be looking into ways to improve the end-to-end latency of these requests, as well as supporting it in non-replicated mode for snappier experiences that don’t require the same level of trust. -
Decentralize the AI workers powering the LLMs
We have a few options, ranging from deploying these workers across node providers, or potentially exploring the idea of “badlands”, where anyone with the right hardware can permissionlessly run an AI worker from their home or data center of choice.
Final thoughts
This work deviates from the approach we’ve been exploring previously at DFINITY, which is to try to run the LLMs inside a canister. We’ve done our research and identified some key bottlenecks to improve the performance of these models, including improving the way canisters handle I/O and optimizing matrix multiplication. However, even with these optimizations, it would only be possible to run models in the ~1B parameter range. For the larger and more useful models (8B, 70B, and beyond), we needed to rethink our approach.
In the long term, we’ll continue investing in the performance of canisters to enable running larger models, but in the meantime, we think the approach outlined here of using AI workers would unlock AI use cases on the IC much more quickly and reliably, and without sacrificing the IC’s trust model.
We’d love to get your feedback on this work. Try it out, build some agents, and let us know what use cases you have in mind. Let us know if there’s anything we can help you with.