Hi @icpp,
We wanted to time the publication with Dom recording a new demo. Tentatively that is planned on Friday, but I don’t know for sure.
In any case, I can edit the blog post and add your number after publication.
Hi @icpp,
We wanted to time the publication with Dom recording a new demo. Tentatively that is planned on Friday, but I don’t know for sure.
In any case, I can edit the blog post and add your number after publication.
I’d like to try and enable SIMD on yllama.oc. All I need to do is to move start using dfx 0.22.0-beta.0
and the rest will be taken care of? If not is there some docs on how to enable SIMD?
@gip: you need also enable it on the compiler side using compiler-specific flags. For example here is how it is done in rust: examples/rust/simd/.cargo/config.toml at master · dfinity/examples · GitHub
Put
[target.wasm32-wasi]
rustflags = ["-Ctarget-feature=+simd128"]
in .cargo/config.toml
or
[target.wasm32-unknown-unknown]
rustflags = ["-Ctarget-feature=+simd128"]
depending on your Wasm target.
You can also set the environment variable like this before bulding:
export RUSTFLAGS=$RUSTFLAGS' -C target-feature=+simd128'
If you’re lucky, then the compiler will do the magic and autovectorize the hot path. Otherwise, you might need to use the Wasm SIMD instructions manually. For example, here is what I did for Sonos Tract: feat: Wasm SIMD implementation of `MatMatMulKer<f32>` by ulan · Pull Request #1420 · sonos/tract · GitHub
Hi everyone, thank you for today’s call (2024.07.11). Special thanks to Vincent and @icarus for sharing their work and insights! This is the generated summary (short version, please find the long version here:
In today’s DeAI Working Group call, Vincent presented his master thesis work, focusing on encryption and machine learning models deployed on the Internet Computer, sharing a prototype link (https://zvrui-xqaaa-aaaao-a3pbq-cai.icp0.io/) and the code repository (GitHub - vince2git/ic-secure-ml). Key discussions covered encryption key exchanges, secure data handling, and using the linfa rust library for benchmarking machine learning models. Vincent detailed the architecture involving t-SNE for clustering and neural network training on the MNIST dataset. @icarus shared insights from the Node provider ICP lab event in Zurich, discussing hardware options, including AMD CPUs with SIMD extensions and integrated CPU-GPU architectures for AI on the IC blockchain (Technical Working Group: Node Providers - #7 by louisevelayo). Challenges of accessing GPU functionality from the WASM runtime and potential solutions through host functions were explored. Future directions emphasized collaboration between node providers and Dfinity engineers, forming a working group to focus on accelerated hardware integration. Open discussions addressed hardware setups, high-level host functions, ONNX for AI model deployment, and the importance of supporting frameworks like PyTorch. Community involvement was encouraged to share benchmarks, explore parallelism, and contribute to ongoing discussions for collective problem-solving.
would this be relevant to DeAI on IC? x.com
@ulan Tried to use SIMD but I get the error below. Any idea?
Caused by: The replica returned a rejection error: reject code CanisterError, reject message Error from Canister bd3sg-teaaa-aaaaa-qaaba-cai: Canister's Wasm module is not valid: Wasmtime failed to validate wasm module wasmtime::Module::validate() failed with SIMD support is not enabled (at offset 0x1e819f).
This is likely an error with the compiler/CDK toolchain being used to build the canister. Please report the error to IC devs on the forum: https://forum.dfinity.org and include which language/CDK was used to create the canister., error code None
dfx --version
dfx 0.22.0-beta.0
Note: the error I reported above about SIMD not enabled is transient and appears / disappears depending on the build. I finally got a build that works without changing anything.
I wasn’t lucky And I had to implement a specialized function to make it work: yllama.rs/ymath/src/lib.rs at main · gip/yllama.rs · GitHub
That worked, I was able to see the SIMD instructions and computation were definitely faster. For posterity this is how the inner loop with SIMD looks now:
v128.const i32x4 0x00000000 0x00000000 0x00000000 0x00000000
local.set 8
i32.const 0
local.set 4
loop ;; label = @5
local.get 8
local.get 1
v128.load align=4
local.get 2
v128.load align=4
f32x4.mul
f32x4.add
local.get 1
i32.const 16
i32.add
v128.load align=4
local.get 2
i32.const 16
i32.add
v128.load align=4
f32x4.mul
f32x4.add
local.set 8
local.get 1
i32.const 32
i32.add
local.set 1
local.get 2
i32.const 32
i32.add
local.set 2
local.get 4
i32.const 8
i32.add
local.tee 4
i32.const 4096
i32.ne
br_if 0 (;@5;)
end
Speed improved dramatically thanks to SIMD and inference is now 22 s / token. A detailed look at the timing below shows that the overhead introduced by the protocol (consensus, inter-canister calls, wasm, …) is now what’s dominating.
It seems unlikely that ICP can become a high-performance target for AI workloads without architectural changes and focus on the tooling at this point. However, there will still be applications where consensus is necessary. Regardless, my investigation is now complete and has been interesting.
My guess would be that you probably had an old dfx
started and running in the background. When changing dfx
versions, I always do 1) dfx stop
2) change the version 3) dfx start
.
Speed improved dramatically thanks to SIMD and inference is now 22 s / token.
Nice! What was the number before? I see potential for more improvement in
f32x4(v0.get(j + 0), v0.get(j + 1), v0.get(j + 2), v0.get(j + 3))
If you can re-arrange the data layout to load all four values with a single instruction:
v128_load(v0 as *const v128);
Then it should become even faster. However, that’s a non-trivial change.
+1, that’s what DFINITY is planning to explore with the “Gyrotron” milestone on the roadmap: Roadmap | Internet Computer
Would you like a reference to your project in the upcoming blogpost about Wasm performance improvements?
If so, then a linkable github document (either a separate file or a linkable section the main README) with the following info would help a lot:
ic_cdk::api::performance_counter(0)
to get the number of executed Wasm instructions at the end a single step (execution). Here is an example: examples/rust/image-classification/src/backend/src/lib.rs at master · dfinity/examples · GitHubThe generated code (shared above) is using v128.load
.
I’ve looked at that roadmap but the compute is only part of it. I would hate to see the Dfinity team build a system that is only used for toy projects, so here are my thoughts and I would really encourage to think about data and not compute first.
I know it’s a lot of investment but having a realistic view of what is needed to win is also important.
I’d like to, not sure when I’ll have time to do these benchmarks though.
I developed a general file service called ic-oss for uploading large files. It uses concurrent uploading and takes about 20 minutes to upload 1GB of data.
The key code and the ic-oss-cli upload tool were used in the ic_panda_ai project.
I will soon extract the core large file handling logic from ic-oss into a standalone library for easy integration into other projects.
Thank you. That’s a great feedback!
I agree that data is important. Improving data bandwidth is tracked in another milestone: Roadmap | Internet Computer
Improved consensus throughput and latency
Improved consensus throughput and latency by better, and less bursty, node bandwidth use. Achieved through not including full messages, but only their hashes and other metadata, in blocks.
The milestone is not AI-specific because it will help almost all applications.
- How will the system ensure a high data bandwidth. Performance these days are in hundreds / thousands of GB/s of data crunched. You probably want to target 10_000 GB/s (aggregated) to make sure you’re building for the future.
Do you mean GPU-memory bandwidth or network bandwidth? What kind of applications and scale do you have in mind?
Initially, we would be quite happy if we could support inference/tuning of LLMs and then think about training of mid-size models. Training of large models is very far in the future.
What is also important is finding use cases that benefit from onchain AI.
- Tools & frontend
100% agree.
I developed a general file service called ic-oss for uploading large files.
Nice!
I published a crate similar to what you are describing, ic-file-uploader, so I wanted to share.
@gip
In the ic-oss
project, I implemented a library named ic-oss-can
, which contains a Rust macro. It can quickly add the large file functionality to your canister, so that large files can be uploaded at high speed using ic-oss-cli
. Here is a reference example for usage:
Thanks - how more efficient and fast and/or less costly than the naive upload that I implemented in a few lines of code? yllama.oc/src/yblock/yblock.did at main · gip/yllama.oc · GitHub
The real question though, it is be able to make some data available to more than one canister fast. On Web2 (most of servers are stateless and data lives in data nodes and in caching nodes) this pattern is everywhere and the programming model of the ICP is clearly behind in that respect imo.
I’m talking about GPU bandwidth. Basically GPU on ICP will probably be ready in 1 to 2 years. The next generation of GPUs will land soon with a vastly improved bandwidth and I would want to build for the future. Of course on-chain access to data is a key point.
The ic-file-uploader
crate does not increase upload bandwidth, but it simplifies the process of uploading large files into canisters. The crate breaks the file into 2MB chunks and uploads them sequentially.
Several teams have implemented similar code, so I decided to create a formal crate to streamline this functionality. I also plan to add automatic resume functionality for interrupted uploads in future updates.
I was sharing this with @zensh and for the benefit of future readers. If you have any suggestions, feel free to reach out.
I made a uploader to upload 70GiB to canister. To upload 70GiB it took for 12 hours and costed 200T cycles.
https://github.com/ClankPan/ic-vectune/blob/67a3881bc317ca7ee5b36b053b72fc0200100927/bin/tool/src/main.rs#L92
it also has the Interrupt feature.
I thought the max storage per canister was something around 4 GB. Were the 70GB you are talking about consumed and not stored? Curious what was the use case?