Walsh: Hypercomplex LLM Inference on the IC (1.58-bit Quantized)

Hi everyone,

I wanted to share a research project I’ve been working on called Walsh - an experimental LLM architecture designed specifically to fit within the constraints of the Internet Computer while exploring hyper-complex algebras.

The “Why”

Running LLMs on-chain is hard. You have strict instruction limits (20B per round) and limited memory bandwidth. Standard FP16 models are too heavy.

Walsh approaches this with two aggressive techniques:

1. 1.58-bit Ternary Weights: Following the BitNet b1.58 paper, weights are constrained to `{-1, 0, 1}`, forcing the model to learn efficiently and reducing memory bandwidth.

2. Octonion (8D) Algebra: Instead of independent dimensions, we use Cayley-Dickson algebras. This enforces structured geometric relationships in the latent space, potentially allowing us to compress more “intelligence” into fewer parameters.

The Training Stack (Python + CUDA)

Before reaching the IC, we train these models in PyTorch. The octonion algebra involves 64 separate sub-multiplications for every efficient dot product. To make this viable, we wrote **Custom Fused CUDA Kernels** using Triton.

Key Metrics:

- Inference Latency: 0.35ms (Fused) vs 2.10ms (Standard) → 6x Speedup

- Training Speed: 106ms/iter vs 646ms/iter on TinyShakespeare dataset → 6x Speedup

- Precision: The fused kernel accumulates intermediate results in FP32 before quantization, leading to significantly cleaner gradients than the standard PyTorch autograd implementation.

The repository includes a complete training suite (`train.py`, `prepare.py`) that can train everything from TinyShakespeare up to 124M parameter models (GPT-2 Small scale) on a local GPU with >= 12GB VRAM. The system automatically detects CUDA availability and swaps in the optimized kernels.

The IC Inference Engine

We then built a custom Rust inference engine from scratch (no tch-rs/candle) to run the trained models strictly in Wasm.

Performance

- Batched Inference: We achieved 1.51 tokens/sec aggregate throughput by running concurrent user sessions.

- Optimization: We implemented a custom fused MatMul loop that inverts standard matrix multiplication to minimize weight loading overhead, giving a 3x speedup on the IC.

- KV Cache: Fully implemented O(1) incremental attention, meaning generation speed doesn’t degrade as the context grows.

Architecture

- Chunked Execution: The forward pass is split into chunks (e.g., 8 layers at a time) to respect the block instruction limit.

- Session Management: A custom session manager tracks generation states for multiple users, allowing the canister to serve parallel requests without blocking.

Next Steps

Use cases like this show that the IC can do heavy compute if you optimize specifically for the Wasm environment. We are currently working on porting the inference engine directly to the browser (Client-side Wasm) to eliminate network latency entirely, using the IC for model distribution and coordination.

The code is open source and available here: https://github.com/pulseofthemachine/Walsh-Research/

8 Likes

UPDATE:

We’ve removed session management, instead focusing on single-user optimization. On a local replica (running on consumer-grade hardware) we were able to achieve a maximum of 4.76 tok/sec on the 860k parameter TinyShakespeare model, generating 50 tokens in a single update call, without breaking the instruction limit. Results of scaling to larger models, and performance on mainnet, are still unknown.

UPDATE 2: Adaptive Instruction Monitoring + Octonion Attention

Performance Improvements

We’ve shipped several optimizations to handle any model size within ICP’s 40B instruction limit:

Adaptive Chunking

  • The inference engine now monitors instruction usage in real-time using performance_counter(0)

  • Automatically pauses at 60% budget and returns partial results

  • Caller loops until done - works for 8-layer or 124-layer models without code changes

Optimized Bitmask MatMul

  • Eliminated expensive division operations in the hot loop (critical for WASM)

  • Counter-based iteration instead of index recalculation per weight

  • 8-position skip when bitmask byte is zero

Results (3.2M param model):

  • 1.73 tok/s sustained over 100 tokens on local replica

  • Adaptive chunking worked correctly, pausing at ~60% budget per call


New: Octonion Attention (Experimental)

We’re testing a new Octonion Head Mixing mechanism:

Standard Attention:

q, k, v → scaled_dot_product → concat heads → output

SpinNet + Octonion Attention:

q, k, v → scaled_dot_product → OCTONION MIX HEADS → output

After attention computes 8 heads, we mix them using Cayley-Dickson algebra (same structure as our linear layers). This introduces non-commutativity at the head interaction level.

Early A/B Results:

Model Val Loss @ 200
Baseline 2.380
+ Octonion Attention 2.367

That’s a 0.5% improvement just from structured head mixing. Training continues on TinyStories (token-level, GPT-2 tokenizer) for a proper evaluation.


Code

All changes are on GitHub: pulseofthemachine/SpinNet-Research

Next: Training 29M TinyStories model with octonion attention to see if the improvement scales.

1 Like

UPDATE 3: TinyStories Training Complete + Compression Analysis

Training Results

Trained a 28.9M parameter SpinNet model on TinyStories (GPT-2 tokenizer):

Metric Value
Final Train Loss 2.68
Final Val Loss 2.68
Parameters 28.9M
Training 10k iterations (~0.7 epochs)

Comparison to TinyStories paper baseline (33M, 4 layers):

Model Epochs Val Loss
Baseline (FP32) 20 1.20
SpinNet (Ternary) 0.7 2.68

We trained for 29x fewer epochs and achieved coherent story output. The model produces grammatical English with named characters, dialogue, and plot - despite seeing less than one pass through the data.

No overfitting! Train and val loss converged together. During early training we observed val loss lower than train loss - ternary quantization acts as strong regularization, preventing memorization.


Compression Results

Format Size Compression
TinyStories-33M (HuggingFace) 291 MB Baseline
SpinNet PyTorch (FP32) 331 MB -
SpinNet Ternary (.spinnet) 50 MB 6x smaller than baseline

Despite having similar parameter counts, SpinNet’s bitmask-sparse ternary format is 6x smaller than the published baseline model.

Breakdown:

  • Embeddings: FP16 (required for vocab lookup)

  • Transformer weights: 2-bit ternary + bitmask (70% sparse)

  • Average layer compression: 34-37% smaller than packed ternary

This 50 MB model fits comfortably within ICP canister memory limits.


Octonion Dimension Specialization (Novel Finding!)

We built an analysis tool to probe which octonion dimensions encode which linguistic concepts:

Category Most Active Dimensions
Nouns e₀, e₁, e₇
Verbs e₀, e₇, e₁
Pronouns e₀, e₇, e₂
Emotions e₀, e₁, e₃
Dialogue e₀, e₂, e₁
Story Structure e₀, e₁, e₆

Key findings:

  • e₀ (real): Base representation - highest activation for all categories

  • e₇: Tracks specificity (strong for nouns, verbs)

  • e₃: Encodes semantic/emotional content

  • e₂: Dialogue and pronoun structure

This specialization emerged without explicit supervision. The Cayley-Dickson algebra forces dimensions to interact in structured ways, and the model learns to exploit this for different linguistic features.


Implications

  1. On-Chain Viability: 50 MB model fits in canister memory. Combined with adaptive instruction chunking, we can run real transformer inference on ICP.

  2. Inductive Bias Works: Octonion dimensions specialize for different linguistic concepts. This suggests hypercomplex algebras provide useful structure for language models.

  3. Compression + Quality: Ternary quantization + 70% sparsity achieves 85% compression with no quality loss vs. baseline.

  4. Scalability Question: Does this specialization pattern hold at larger scales? Next test: 124M Scholar model.


Sample Output

PROMPT: Once upon a time, there was a
OUTPUT: Once upon a time, there was a little girl named Lily. 

Lily was so excited to have a picnic on the beach...

When they got to the beach, they saw a dolphin swimming in the sea.

Coherent stories with named characters, dialogue, and plot structure.

Code

GitHub: pulseofthemachine/SpinNet-Research

New tools:

  • tools/analyze_octonion.py - Dimension specialization analysis

Today’s Milestone: Coherent On-Chain Generation

We finally got the full stack working end-to-end:

Python (CUDA)

Once upon a time and there was a little girl. She loved to play with her toys and make music...

Rust/WASM (Internet Computer)

Once upon a time, there was a little girl named Lily. She liked to play outside and play with her friends...

Both produce coherent children’s stories from the same 29M parameter model trained on TinyStories.


What We Shipped

1. Octonion Head Mixer in Rust

The key innovation is mixing attention heads using Cayley-Dickson algebra:

y[i] = Σ_j sign[i,j] × x[j] @ W[idx[i,j]]

Where sign and idx are the 8×8 octonion multiplication tables. This gives us structured cross-head communication that reduces loss by ~8% compared to standard attention.

Implementing this correctly in Rust was tricky—matrix indexing had to match PyTorch’s convention exactly.

2. GPT-2 Tokenizer in Rust

Added an auto-detecting tokenizer that switches between:

  • Char-level (for vocab ≤ 256)

  • GPT-2 BPE (for larger vocab)

The GPT-2 vocab is embedded as base64 in the binary—no external files needed.

3. Temperature Sampling

Replaced greedy argmax with softmax + multinomial sampling (T=0.8). This eliminated repetition loops.


Performance

Metric Value
Model Size 29M params
Compressed Size 50 MB (.spinnet)
Sparsity 58% zeros
ICP Speed (Local Replica) 0.7 tok/s
CUDA Speed ~14 tok/s

Try It

Clone the repo and deploy to local replica:

git clone https://github.com/pulseofthemachine/SpinNet-Research
cd SpinNet-Research/inference
dfx start --background
dfx deploy
./verify_single_user.sh "Once upon a time" 50
1 Like

UPDATE 4: Hadamard 32D + Hash Embeddings — Extreme Compression

Big progress over the holidays! We’ve pushed SpinNet’s compression even further with two new techniques.

New: Hadamard 32D Algebra

We’ve added a second algebra option alongside Octonion 8D:

Algebra Dimension Compression Mixing Complexity
Octonion 8D 1/8th params O(n²)
Hadamard 32D 1/32th params O(n log n)

Hadamard uses the Fast Hadamard Transform (FHT) for structured mixing — 32 dimensions interact in log₂(32)=5 butterfly stages. This means:

  • 32x parameter compression on linear layers (vs 8x for Octonion)

  • Faster mixing via FHT instead of Cayley-Dickson multiplication

  • Same variance-preserving initialization

    β = sqrt(3/(2×fan_in))
    

NEW: Hash Embeddings (Experimental)

The embedding layer dominated our parameter count (97% of a 26M model). We implemented composite hash embeddings:

Standard:  embedding[token_id]  → 25.6M params
Hash:      emb_1[hash1(id)] + emb_2[hash2(id)] + emb_3[hash3(id)]  → 1.5M params

Key innovation: Learned Importance Weights

Each token learns which hash table matters most, allowing the model to disambiguate collisions:

python
weights = importance[token_id]  # [num_tables]
embedding = sum(emb_tables[i][hash_i(id)] * weights[i])

Results: 12L × 1024D TinyStories

Model Total Params Val Loss @ 1400 VRAM
Standard Hadamard 26.7M ~2.2 6GB
Hash + Hadamard 8.5M 2.38 3.6GB

Sample output (loss 2.38):

Once upon a time, there was a little girl named Jenny. She loved to explore the world around her. One day, she was walking along when she saw something very interesting. It was a big, yellow duck! She wanted to pet it so badly. Jenny ran over to it and said, "Hello, little duck. Do you want to be my friend?"

Coherent stories, dialogue, named characters — at 68% fewer parameters!

Memory-Efficient Training: Chunked Cross Entropy

Hash embeddings require reconstructing the full vocab embedding for loss computation. We implemented chunked cross entropy that processes vocab in 4K chunks:

  • Never materializes full [batch×seq, 50K] logits tensor

  • Reduces peak VRAM by ~60%

  • Enables training larger models on consumer GPUs

Bonus: Image Generation Pipeline

We’re now exploring whether hash embeddings work for image generation (where there are no “facts” to memorize — only patterns). Created:

  • data/mnist/prepare.py — Tokenizes MNIST to 49 patch tokens per image

  • config/train_mnist_image.py — 6L×256D Hadamard + hash model

  • generate_image.py — Generates images from checkpoint

If hash embeddings work for vision, we could build tiny generative models that are architecturally incapable of “memorizing” training data.

Implications for IC Deployment

Hash embeddings reduce the model size that needs to be stored in canister memory. An 8.5M param model with hash embeddings could potentially fit in a smaller memory footprint than our previous 26M models, making on-chain inference more viable.

Code

Updated on GitHub: SpinNet-Research

New files:

  • src/model/fht_cuda.py — Hadamard 32D implementation

  • config/train_tinystories_hadamard_hash.py — Hash embedding config

  • data/mnist/ — Image generation experiment

WOW. Awesome stuff!

Isn’t the per round instruction limit 40B? Is it really 20?

What is the reason for this limit and is DFINITY planning to raise it in the future or is this not feasible and will slow down the network too much or cause instability?

I believe the guy who was working on the WASP project (porting php, mysql lite and Wordpress over to run in a canister) stated he needed something like a 200B per round limit in order for all the includes to complete for WordPress and therefore he had to end his project due to this limit.

Ideally what would you like to see the instruction limit per round be?

Correct, it is 40B. I was working with outdated info when I originally posted this thread and later learned I could increase batch sizes.

I assume the limitations are because of the hardware running the nodes, and will improve once that improves?

Ideally the instruction limit would be whatever amount is needed to generate text at speed matching typical human reading speed (~5tok/sec for GPT-2 vocab).

Our best benchmark is 0.7 tokens/sec on the old model, but I haven’t implemented the Hadamard algebra into rust yet. The batched generation + KV cache makes it POSSIBLE to output a significant amount of tokens on much larger models, but larger models = smaller batches = much more latency. The additional compression and new architecture should allow for larger batches in theory, but still need to test this.

200B would be great for 128M param models, maybe overkill for ~30M parameter models.

Update 5: Rebranding to Walsh & 32D Hadamard Inference :rocket:

We are rebranding this project to Walsh, reflecting our shift towards Fast Walsh-Hadamard Transforms as the core computational primitive. We’ve also renamed our repository: Walsh-Research.

Key Technical Milestones:

  1. 32D Hadamard + Hash Embeddings: The Rust inference engine now fully supports 32D Hadamard algebra and learned Hash Embeddings. This allows us to maintain high model expressivity while significantly reducing parameter counts.

  2. Massive Compression: We’ve achieved a ~12.5x reduction in model size. Our TinyShakespeare model was compressed from 11.2 MB down to just 890 KB, making it exceptionally light for on-chain storage and execution.

  3. High-Performance Inference: On the local IC replica, we are now hitting ~34 tokens per second. This is made possible by optimized 1.58-bit (ternary) matmuls and a robust KV-cached, burst-mode generation flow.

Sample Output (TinyShakespeare):

PROMPT: "ROMEO:"
═══════════════════════════════════════════════════════════════

OUTPUT (~256 tokens):

═══════════════════════════════════════════════════════════════

Then in if metter, and fight theest; is hou mostrrow it's with templany,

But yours that be that the doves the sauch forford:

That be bold the been in your's liffence

For heartivender mine will whose him thee,

But my commoself to the be them seet and

Sout 

═══════════════════════════════════════════════════════════════

STATS:

  Estimated tokens: 256

  Duration: 7.57s

  Speed: 33.65 tok/s

═══════════════════════════════════════════════════════════════

The next phase of research will focus on scaling these techniques to the TinyStories dataset to evaluate narrative coherence with 32D Hypercomplex algebras.

1 Like