Technical Working Group DeAI

Hi @icpp,

We wanted to time the publication with Dom recording a new demo. Tentatively that is planned on Friday, but I don’t know for sure.

In any case, I can edit the blog post and add your number after publication.

2 Likes

I’d like to try and enable SIMD on yllama.oc. All I need to do is to move start using dfx 0.22.0-beta.0 and the rest will be taken care of? If not is there some docs on how to enable SIMD?

@gip: you need also enable it on the compiler side using compiler-specific flags. For example here is how it is done in rust: examples/rust/simd/.cargo/config.toml at master · dfinity/examples · GitHub
Put

[target.wasm32-wasi]
rustflags = ["-Ctarget-feature=+simd128"]

in .cargo/config.toml

or

[target.wasm32-unknown-unknown]
rustflags = ["-Ctarget-feature=+simd128"]

depending on your Wasm target.

You can also set the environment variable like this before bulding:

export RUSTFLAGS=$RUSTFLAGS' -C target-feature=+simd128'

If you’re lucky, then the compiler will do the magic and autovectorize the hot path. Otherwise, you might need to use the Wasm SIMD instructions manually. For example, here is what I did for Sonos Tract: feat: Wasm SIMD implementation of `MatMatMulKer<f32>` by ulan · Pull Request #1420 · sonos/tract · GitHub

1 Like

Hi everyone, thank you for today’s call (2024.07.11). Special thanks to Vincent and @icarus for sharing their work and insights! This is the generated summary (short version, please find the long version here:

In today’s DeAI Working Group call, Vincent presented his master thesis work, focusing on encryption and machine learning models deployed on the Internet Computer, sharing a prototype link (https://zvrui-xqaaa-aaaao-a3pbq-cai.icp0.io/) and the code repository (GitHub - vince2git/ic-secure-ml). Key discussions covered encryption key exchanges, secure data handling, and using the linfa rust library for benchmarking machine learning models. Vincent detailed the architecture involving t-SNE for clustering and neural network training on the MNIST dataset. @icarus shared insights from the Node provider ICP lab event in Zurich, discussing hardware options, including AMD CPUs with SIMD extensions and integrated CPU-GPU architectures for AI on the IC blockchain (Technical Working Group: Node Providers - #7 by louisevelayo). Challenges of accessing GPU functionality from the WASM runtime and potential solutions through host functions were explored. Future directions emphasized collaboration between node providers and Dfinity engineers, forming a working group to focus on accelerated hardware integration. Open discussions addressed hardware setups, high-level host functions, ONNX for AI model deployment, and the importance of supporting frameworks like PyTorch. Community involvement was encouraged to share benchmarks, explore parallelism, and contribute to ongoing discussions for collective problem-solving.

Links shared during the call:

2 Likes

would this be relevant to DeAI on IC? x.com

3 Likes

@ulan Tried to use SIMD but I get the error below. Any idea?

Caused by: The replica returned a rejection error: reject code CanisterError, reject message Error from Canister bd3sg-teaaa-aaaaa-qaaba-cai: Canister's Wasm module is not valid: Wasmtime failed to validate wasm module wasmtime::Module::validate() failed with SIMD support is not enabled (at offset 0x1e819f).
This is likely an error with the compiler/CDK toolchain being used to build the canister. Please report the error to IC devs on the forum: https://forum.dfinity.org and include which language/CDK was used to create the canister., error code None
dfx --version
dfx 0.22.0-beta.0

Note: the error I reported above about SIMD not enabled is transient and appears / disappears depending on the build. I finally got a build that works without changing anything.

I wasn’t lucky :slight_smile: And I had to implement a specialized function to make it work: yllama.rs/ymath/src/lib.rs at main · gip/yllama.rs · GitHub

That worked, I was able to see the SIMD instructions and computation were definitely faster. For posterity this is how the inner loop with SIMD looks now:

            v128.const i32x4 0x00000000 0x00000000 0x00000000 0x00000000
            local.set 8
            i32.const 0
            local.set 4
            loop  ;; label = @5
              local.get 8
              local.get 1
              v128.load align=4
              local.get 2
              v128.load align=4
              f32x4.mul
              f32x4.add
              local.get 1
              i32.const 16
              i32.add
              v128.load align=4
              local.get 2
              i32.const 16
              i32.add
              v128.load align=4
              f32x4.mul
              f32x4.add
              local.set 8
              local.get 1
              i32.const 32
              i32.add
              local.set 1
              local.get 2
              i32.const 32
              i32.add
              local.set 2
              local.get 4
              i32.const 8
              i32.add
              local.tee 4
              i32.const 4096
              i32.ne
              br_if 0 (;@5;)
            end

Speed improved dramatically thanks to SIMD and inference is now 22 s / token. A detailed look at the timing below shows that the overhead introduced by the protocol (consensus, inter-canister calls, wasm, …) is now what’s dominating.

It seems unlikely that ICP can become a high-performance target for AI workloads without architectural changes and focus on the tooling at this point. However, there will still be applications where consensus is necessary. Regardless, my investigation is now complete and has been interesting.

3 Likes

My guess would be that you probably had an old dfx started and running in the background. When changing dfx versions, I always do 1) dfx stop 2) change the version 3) dfx start.

Speed improved dramatically thanks to SIMD and inference is now 22 s / token.

Nice! What was the number before? I see potential for more improvement in

f32x4(v0.get(j + 0), v0.get(j + 1), v0.get(j + 2), v0.get(j + 3))

If you can re-arrange the data layout to load all four values with a single instruction:

v128_load(v0 as *const v128);

Then it should become even faster. However, that’s a non-trivial change.

+1, that’s what DFINITY is planning to explore with the “Gyrotron” milestone on the roadmap: Roadmap | Internet Computer

Would you like a reference to your project in the upcoming blogpost about Wasm performance improvements?

If so, then a linkable github document (either a separate file or a linkable section the main README) with the following info would help a lot:

  • A brief description of your project and the benchmark you did.
  • Measurements numbers with the old dfx and the new dfx. Ideally these measurements are only about Wasm execution (i.e. don’t include consensus and inter-canister calls). The simplest way to do it is to call ic_cdk::api::performance_counter(0) to get the number of executed Wasm instructions at the end a single step (execution). Here is an example: examples/rust/image-classification/src/backend/src/lib.rs at master · dfinity/examples · GitHub
3 Likes

The generated code (shared above) is using v128.load.

I’ve looked at that roadmap but the compute is only part of it. I would hate to see the Dfinity team build a system that is only used for toy projects, so here are my thoughts and I would really encourage to think about data and not compute first.

  • How will people get the data to the system? You probably want a data lane / data nodes that also will make economic sense (I spent 50TC uploading data and it was so slow). Consensus gating data loading makes little sense, there are other solutions to ensure data integrity. Basically every single system online (including a simple web app) has highly optimized data layers and the ICP needs it too I believe.
  • How will the system ensure a high data bandwidth. Performance these days are in hundreds / thousands of GB/s of data crunched. You probably want to target 10_000 GB/s (aggregated) to make sure you’re building for the future.
  • Tools & frontend
  • Compute is probably the easy part even though GPU reliability issues will be an issue at scale.

I know it’s a lot of investment but having a realistic view of what is needed to win is also important.

I’d like to, not sure when I’ll have time to do these benchmarks though.

2 Likes

I developed a general file service called ic-oss for uploading large files. It uses concurrent uploading and takes about 20 minutes to upload 1GB of data.

The key code and the ic-oss-cli upload tool were used in the ic_panda_ai project.

I will soon extract the core large file handling logic from ic-oss into a standalone library for easy integration into other projects.

3 Likes

Thank you. That’s a great feedback!

I agree that data is important. Improving data bandwidth is tracked in another milestone: Roadmap | Internet Computer

Improved consensus throughput and latency

Improved consensus throughput and latency by better, and less bursty, node bandwidth use. Achieved through not including full messages, but only their hashes and other metadata, in blocks.

The milestone is not AI-specific because it will help almost all applications.

  • How will the system ensure a high data bandwidth. Performance these days are in hundreds / thousands of GB/s of data crunched. You probably want to target 10_000 GB/s (aggregated) to make sure you’re building for the future.

Do you mean GPU-memory bandwidth or network bandwidth? What kind of applications and scale do you have in mind?

Initially, we would be quite happy if we could support inference/tuning of LLMs and then think about training of mid-size models. Training of large models is very far in the future.

What is also important is finding use cases that benefit from onchain AI.

  • Tools & frontend

100% agree.

I developed a general file service called ic-oss for uploading large files.

Nice!

2 Likes

I published a crate similar to what you are describing, ic-file-uploader, so I wanted to share.

2 Likes

@gip
In the ic-oss project, I implemented a library named ic-oss-can, which contains a Rust macro. It can quickly add the large file functionality to your canister, so that large files can be uploaded at high speed using ic-oss-cli. Here is a reference example for usage: