Hi everyone,
I wanted to share a research project I’ve been working on called Walsh - an experimental LLM architecture designed specifically to fit within the constraints of the Internet Computer while exploring hyper-complex algebras.
The “Why”
Running LLMs on-chain is hard. You have strict instruction limits (20B per round) and limited memory bandwidth. Standard FP16 models are too heavy.
Walsh approaches this with two aggressive techniques:
1. 1.58-bit Ternary Weights: Following the BitNet b1.58 paper, weights are constrained to `{-1, 0, 1}`, forcing the model to learn efficiently and reducing memory bandwidth.
2. Octonion (8D) Algebra: Instead of independent dimensions, we use Cayley-Dickson algebras. This enforces structured geometric relationships in the latent space, potentially allowing us to compress more “intelligence” into fewer parameters.
The Training Stack (Python + CUDA)
Before reaching the IC, we train these models in PyTorch. The octonion algebra involves 64 separate sub-multiplications for every efficient dot product. To make this viable, we wrote **Custom Fused CUDA Kernels** using Triton.
Key Metrics:
- Inference Latency: 0.35ms (Fused) vs 2.10ms (Standard) → 6x Speedup
- Training Speed: 106ms/iter vs 646ms/iter on TinyShakespeare dataset → 6x Speedup
- Precision: The fused kernel accumulates intermediate results in FP32 before quantization, leading to significantly cleaner gradients than the standard PyTorch autograd implementation.
The repository includes a complete training suite (`train.py`, `prepare.py`) that can train everything from TinyShakespeare up to 124M parameter models (GPT-2 Small scale) on a local GPU with >= 12GB VRAM. The system automatically detects CUDA availability and swaps in the optimized kernels.
The IC Inference Engine
We then built a custom Rust inference engine from scratch (no tch-rs/candle) to run the trained models strictly in Wasm.
Performance
- Batched Inference: We achieved 1.51 tokens/sec aggregate throughput by running concurrent user sessions.
- Optimization: We implemented a custom fused MatMul loop that inverts standard matrix multiplication to minimize weight loading overhead, giving a 3x speedup on the IC.
- KV Cache: Fully implemented O(1) incremental attention, meaning generation speed doesn’t degrade as the context grows.
Architecture
- Chunked Execution: The forward pass is split into chunks (e.g., 8 layers at a time) to respect the block instruction limit.
- Session Management: A custom session manager tracks generation states for multiple users, allowing the canister to serve parallel requests without blocking.
Next Steps
Use cases like this show that the IC can do heavy compute if you optimize specifically for the Wasm environment. We are currently working on porting the inference engine directly to the browser (Client-side Wasm) to eliminate network latency entirely, using the IC for model distribution and coordination.
The code is open source and available here: https://github.com/pulseofthemachine/Walsh-Research/