Hi everyone, thank you for today’s call (2025.03.13). Special thanks to @icarus for leading the call! This is the generated summary (short version, please find the long version here ): In today’s DeAI Working Group call for the Internet Computer, ETH students introduced an inference engine project using a 1B-parameter Llama 3 model, exploring optimizations via the Mistral RS library and considering alternatives like Candle and Llama.cpp. The group focused on discussions on upgrading hardware to Gen-3 AMD EPYC Zen 5 CPUs and integrating GPUs, highlighting NVIDIA’s H100/H200, AMD Instinct, and emerging accelerators such as Tenstorrent to meet ICP’s future AI workload requirements.
Links shared during the call:
- GitHub - EricLBuehler/mistral.rs: Blazingly fast LLM inference.
- original llama.cpp: GitHub - ggml-org/llama.cpp: LLM inference in C/C++
- llama.cpp running in a canister of the IC: GitHub - onicai/llama_cpp_canister: llama.cpp for the Internet Computer
- summary of the max tokens per update call investigation: GitHub - onicai/llama_cpp_canister: llama.cpp for the Internet Computer
- summary of the last hardware-focused call as a reference: DeAIWorkingGroupInternetComputer/WorkingGroupMeetings/2025.02.13 at main · DeAIWorkingGroupInternetComputer/DeAIWorkingGroupInternetComputer · GitHub
- NVIDIA Data Center GPU Resource Center
- example server: MiTAC TN85B8261 B8261T85E8HR-2T-N Overview
- https://www.gigabyte.com/Enterprise/Rack-Server/XV23-ZX0-AAJ1-rev-3x#Overview
- NVIDIA H100 Tensor Core GPU
- https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html
- Wormhole™
- https://www.modular.com/
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog
- Decreasing HTTP Outcall Latency and Cost - #7 by lastmjs