Technical Working Group DeAI

ulan · July 10, 2024, 6:22pm

We wanted to time the publication with Dom recording a new demo. Tentatively that is planned on Friday, but I don’t know for sure.

In any case, I can edit the blog post and add your number after publication.

gip · July 10, 2024, 7:53pm

I’d like to try and enable SIMD on yllama.oc. All I need to do is to move start using dfx 0.22.0-beta.0 and the rest will be taken care of? If not is there some docs on how to enable SIMD?

ulan · July 11, 2024, 10:06am

@gip: you need also enable it on the compiler side using compiler-specific flags. For example here is how it is done in rust: examples/rust/simd/.cargo/config.toml at master · dfinity/examples · GitHub
Put

[target.wasm32-wasi]
rustflags = ["-Ctarget-feature=+simd128"]

in .cargo/config.toml

or

[target.wasm32-unknown-unknown]
rustflags = ["-Ctarget-feature=+simd128"]

depending on your Wasm target.

You can also set the environment variable like this before bulding:

export RUSTFLAGS=$RUSTFLAGS' -C target-feature=+simd128'

If you’re lucky, then the compiler will do the magic and autovectorize the hot path. Otherwise, you might need to use the Wasm SIMD instructions manually. For example, here is what I did for Sonos Tract: feat: Wasm SIMD implementation of `MatMatMulKer<f32>` by ulan · Pull Request #1420 · sonos/tract · GitHub

patnorris · July 11, 2024, 6:47pm

Hi everyone, thank you for today’s call (2024.07.11). Special thanks to Vincent and @icarus for sharing their work and insights! This is the generated summary (short version, please find the long version here:

In today’s DeAI Working Group call, Vincent presented his master thesis work, focusing on encryption and machine learning models deployed on the Internet Computer, sharing a prototype link (https://zvrui-xqaaa-aaaao-a3pbq-cai.icp0.io/) and the code repository (GitHub - vince2git/ic-secure-ml). Key discussions covered encryption key exchanges, secure data handling, and using the linfa rust library for benchmarking machine learning models. Vincent detailed the architecture involving t-SNE for clustering and neural network training on the MNIST dataset. @icarus shared insights from the Node provider ICP lab event in Zurich, discussing hardware options, including AMD CPUs with SIMD extensions and integrated CPU-GPU architectures for AI on the IC blockchain (Technical Working Group: Node Providers - #7 by louisevelayo). Challenges of accessing GPU functionality from the WASM runtime and potential solutions through host functions were explored. Future directions emphasized collaboration between node providers and Dfinity engineers, forming a working group to focus on accelerated hardware integration. Open discussions addressed hardware setups, high-level host functions, ONNX for AI model deployment, and the importance of supporting frameworks like PyTorch. Community involvement was encouraged to share benchmarks, explore parallelism, and contribute to ongoing discussions for collective problem-solving.

Links shared during the call:

https://zvrui-xqaaa-aaaao-a3pbq-cai.icp0.io/
GitHub - vince2git/ic-secure-ml
Technical Working Group: Node Providers - #7 by louisevelayo
https://wasmedge.org/book/en/sdk/rust/table_and_funcref.html?highlight=“host%20function”#define-host-function
rust-connect-py-ai-to-ic/python at main · modclub-app/rust-connect-py-ai-to-ic · GitHub
Support for DeAI projects: DeAI Success on ICP Survey
DeAI manifesto: Manifesto for Decentralized AI (DeAI) - Google Docs
DeAI docs: AI Documentation - Google Docs
DeAI repo for PR (to add your project): DeAIWorkingGroupInternetComputer/Projects at main · DeAIWorkingGroupInternetComputer/DeAIWorkingGroupInternetComputer · GitHub

JaMarco · July 11, 2024, 7:40pm

would this be relevant to DeAI on IC? x.com

gip · July 11, 2024, 7:51pm

@ulan Tried to use SIMD but I get the error below. Any idea?

Caused by: The replica returned a rejection error: reject code CanisterError, reject message Error from Canister bd3sg-teaaa-aaaaa-qaaba-cai: Canister's Wasm module is not valid: Wasmtime failed to validate wasm module wasmtime::Module::validate() failed with SIMD support is not enabled (at offset 0x1e819f).
This is likely an error with the compiler/CDK toolchain being used to build the canister. Please report the error to IC devs on the forum: https://forum.dfinity.org and include which language/CDK was used to create the canister., error code None
dfx --version
dfx 0.22.0-beta.0

gip · July 11, 2024, 10:19pm

Note: the error I reported above about SIMD not enabled is transient and appears / disappears depending on the build. I finally got a build that works without changing anything.

I wasn’t lucky And I had to implement a specialized function to make it work: yllama.rs/ymath/src/lib.rs at main · gip/yllama.rs · GitHub

That worked, I was able to see the SIMD instructions and computation were definitely faster. For posterity this is how the inner loop with SIMD looks now:

            v128.const i32x4 0x00000000 0x00000000 0x00000000 0x00000000
            local.set 8
            i32.const 0
            local.set 4
            loop  ;; label = @5
              local.get 8
              local.get 1
              v128.load align=4
              local.get 2
              v128.load align=4
              f32x4.mul
              f32x4.add
              local.get 1
              i32.const 16
              i32.add
              v128.load align=4
              local.get 2
              i32.const 16
              i32.add
              v128.load align=4
              f32x4.mul
              f32x4.add
              local.set 8
              local.get 1
              i32.const 32
              i32.add
              local.set 1
              local.get 2
              i32.const 32
              i32.add
              local.set 2
              local.get 4
              i32.const 8
              i32.add
              local.tee 4
              i32.const 4096
              i32.ne
              br_if 0 (;@5;)
            end

Speed improved dramatically thanks to SIMD and inference is now 22 s / token. A detailed look at the timing below shows that the overhead introduced by the protocol (consensus, inter-canister calls, wasm, …) is now what’s dominating.

It seems unlikely that ICP can become a high-performance target for AI workloads without architectural changes and focus on the tooling at this point. However, there will still be applications where consensus is necessary. Regardless, my investigation is now complete and has been interesting.

ulan · July 12, 2024, 7:56am

My guess would be that you probably had an old dfx started and running in the background. When changing dfx versions, I always do 1) dfx stop 2) change the version 3) dfx start.

Speed improved dramatically thanks to SIMD and inference is now 22 s / token.

Nice! What was the number before? I see potential for more improvement in

f32x4(v0.get(j + 0), v0.get(j + 1), v0.get(j + 2), v0.get(j + 3))

If you can re-arrange the data layout to load all four values with a single instruction:

v128_load(v0 as *const v128);

Then it should become even faster. However, that’s a non-trivial change.

ulan · July 12, 2024, 7:59am

+1, that’s what DFINITY is planning to explore with the “Gyrotron” milestone on the roadmap: Roadmap | Internet Computer

ulan · July 12, 2024, 8:20am

Would you like a reference to your project in the upcoming blogpost about Wasm performance improvements?

If so, then a linkable github document (either a separate file or a linkable section the main README) with the following info would help a lot:

A brief description of your project and the benchmark you did.
Measurements numbers with the old dfx and the new dfx. Ideally these measurements are only about Wasm execution (i.e. don’t include consensus and inter-canister calls). The simplest way to do it is to call ic_cdk::api::performance_counter(0) to get the number of executed Wasm instructions at the end a single step (execution). Here is an example: examples/rust/image-classification/src/backend/src/lib.rs at master · dfinity/examples · GitHub

gip · July 12, 2024, 3:35pm

ulan:

Nice! What was the number before? I see potential for more improvement in
f32x4(v0.get(j + 0), v0.get(j + 1), v0.get(j + 2), v0.get(j + 3))
If you can re-arrange the data layout to load all four values with a single instruction:
v128_load(v0 as *const v128);
Then it should become even faster. However, that’s a non-trivial change.

The generated code (shared above) is using v128.load.

I’ve looked at that roadmap but the compute is only part of it. I would hate to see the Dfinity team build a system that is only used for toy projects, so here are my thoughts and I would really encourage to think about data and not compute first.

How will people get the data to the system? You probably want a data lane / data nodes that also will make economic sense (I spent 50TC uploading data and it was so slow). Consensus gating data loading makes little sense, there are other solutions to ensure data integrity. Basically every single system online (including a simple web app) has highly optimized data layers and the ICP needs it too I believe.
How will the system ensure a high data bandwidth. Performance these days are in hundreds / thousands of GB/s of data crunched. You probably want to target 10_000 GB/s (aggregated) to make sure you’re building for the future.
Tools & frontend
Compute is probably the easy part even though GPU reliability issues will be an issue at scale.

I know it’s a lot of investment but having a realistic view of what is needed to win is also important.

I’d like to, not sure when I’ll have time to do these benchmarks though.

zensh · July 12, 2024, 4:10pm

I developed a general file service called ic-oss for uploading large files. It uses concurrent uploading and takes about 20 minutes to upload 1GB of data.

The key code and the ic-oss-cli upload tool were used in the ic_panda_ai project.

github.com

ldclabs/ic-panda/blob/main/src/ic_panda_ai/README.md

# `ic_panda_ai`
`ic_panda_ai` is an DeAI app, an LLM chatbot, named Panda Oracle, running in an ICP canister.

## On-chain canister

https://dashboard.internetcomputer.org/canister/bwwuq-byaaa-aaaan-qmk4q-cai

## Running locally

Prerequisites: Replica must support **SIMD** and the update message instructions limit should be adjusted to above **2000B**.

Deploy to local network:
```bash
dfx canister create --specified-id bwwuq-byaaa-aaaan-qmk4q-cai ic_panda_ai
dfx deploy ic_panda_ai

# install upload tool
cargo install ic-oss-cli
# create a identity for uploading model files
ic-oss-cli identity --new --file myid.pem

This file has been truncated. show original

I will soon extract the core large file handling logic from ic-oss into a standalone library for easy integration into other projects.

ulan · July 12, 2024, 4:42pm

Thank you. That’s a great feedback!

I agree that data is important. Improving data bandwidth is tracked in another milestone: Roadmap | Internet Computer

Improved consensus throughput and latency

Improved consensus throughput and latency by better, and less bursty, node bandwidth use. Achieved through not including full messages, but only their hashes and other metadata, in blocks.

The milestone is not AI-specific because it will help almost all applications.

How will the system ensure a high data bandwidth. Performance these days are in hundreds / thousands of GB/s of data crunched. You probably want to target 10_000 GB/s (aggregated) to make sure you’re building for the future.

Do you mean GPU-memory bandwidth or network bandwidth? What kind of applications and scale do you have in mind?

Initially, we would be quite happy if we could support inference/tuning of LLMs and then think about training of mid-size models. Training of large models is very far in the future.

What is also important is finding use cases that benefit from onchain AI.

Tools & frontend

100% agree.

I developed a general file service called ic-oss for uploading large files.

Nice!

jeshli · July 12, 2024, 5:57pm

I published a crate similar to what you are describing, ic-file-uploader, so I wanted to share.

zensh · July 14, 2024, 3:51pm

@gip
In the ic-oss project, I implemented a library named ic-oss-can, which contains a Rust macro. It can quickly add the large file functionality to your canister, so that large files can be uploaded at high speed using ic-oss-cli. Here is a reference example for usage:

gip · July 15, 2024, 4:26pm

Thanks - how more efficient and fast and/or less costly than the naive upload that I implemented in a few lines of code? yllama.oc/src/yblock/yblock.did at main · gip/yllama.oc · GitHub

The real question though, it is be able to make some data available to more than one canister fast. On Web2 (most of servers are stateless and data lives in data nodes and in caching nodes) this pattern is everywhere and the programming model of the ICP is clearly behind in that respect imo.

gip · July 15, 2024, 4:34pm

I’m talking about GPU bandwidth. Basically GPU on ICP will probably be ready in 1 to 2 years. The next generation of GPUs will land soon with a vastly improved bandwidth and I would want to build for the future. Of course on-chain access to data is a key point.

jeshli · July 15, 2024, 7:07pm

The ic-file-uploader crate does not increase upload bandwidth, but it simplifies the process of uploading large files into canisters. The crate breaks the file into 2MB chunks and uploads them sequentially.

Several teams have implemented similar code, so I decided to create a formal crate to streamline this functionality. I also plan to add automatic resume functionality for interrupted uploads in future updates.

I was sharing this with @zensh and for the benefit of future readers. If you have any suggestions, feel free to reach out.

KinicDevContributor · July 17, 2024, 5:03am

I made a uploader to upload 70GiB to canister. To upload 70GiB it took for 12 hours and costed 200T cycles.

https://github.com/ClankPan/ic-vectune/blob/67a3881bc317ca7ee5b36b053b72fc0200100927/bin/tool/src/main.rs#L92

it also has the Interrupt feature.

gip · July 17, 2024, 5:10pm

I thought the max storage per canister was something around 4 GB. Were the 70GB you are talking about consumed and not stored? Curious what was the use case?

Topic		Replies	Views
Come hear about the state of the ART on ZKML. *ICP is the global orchestration layer for DeAI Showcase	16	224	May 8, 2025
Announcing Technical Working Groups Developers	38	25928	July 25, 2024
DeAI Marketing Site and Campaign Survey: Q2 2024 Developers	0	120	April 15, 2024
Technical Working Group: Scalability & Performance Developers Discussing , community-consideration	170	9879	May 15, 2025
Working with Decentralized LLM Developers	2	520	January 14, 2024

Technical Working Group DeAI

Links shared during the call:

Related topics