Parallel execution for canisters

Kurt · March 2, 2025, 6:30am

Hello everyone,

I would like to know if there are any plans to implement Parallel execution for canisters ?

Specially for hosting Liquid AI models on chain (as opposed to TEE) …e.g., parallel processing within a single canister.

Look forward to hearing from you
Kurt

berestovskyy · March 3, 2025, 8:59am

Hey Kurt,
Yes, parallel execution is probably the main way to speed up computations these days. The big downside, though, is non-determinism: you might easily get slightly different results across replicas.

Wasmtime (the engine we’re using) has pretty good support for threads, but direct multi-threading in a canister isn’t on the roadmap. We’ve been looking into other options, @ielashi can give you more info.

Kurt · March 3, 2025, 9:29am

Thanks for getting back to me on this …

Parallel execution’s the speed king, but non-determinism’s a pain=different results every time, ugh.
Wasmtime handles threads well, but canister multi-threading? Not happening. What about scheduling? Any smart tricks there to boost performance without the parallel mess? @ielashi
, thoughts?

kevinmcfarlane · March 3, 2025, 3:56pm

Doesn’t multithreading within a canister defeat The Actor Model on which they are based?

Canisters as actors

Canisters are much like actors when thinking about them abstractly. Following the actor model of concurrent computation, canisters respond to messages they receive by performing one or more of the following actions:

Modifying their private state.

Sending messages to other canisters (actors).

Creating more canisters (actors).

Although canisters have a single thread of execution, multiple can be executed concurrently. This is a key feature of ICP that overcomes the scaling challenges of other smart contract platforms.

Severin · March 3, 2025, 5:21pm

No, single-threading is not a critical feature of an actor model. A multi-threaded canister can still modify its private state, send messages, and create new canisters. The execution mode is an implementation detail of an actor

kevinmcfarlane · March 3, 2025, 5:28pm

I thought the whole idea of actors is that they’re supposed to be single threaded and that you get concurrency and parallelism via systems of actors? At least that was my impression a few years ago when I played around with the likes of Akka.NET.

Update: I’m wrong, see my follow-up.

Kurt · March 4, 2025, 5:34am

Asked my Chinese friend and he came up with the following,

The AI Challenge on ICP

Training and inference require heavy computation while demanding:

Numerical Stability (identical results across 100+ nodes)
Model Accuracy Preservation (no floating point divergence)
Deterministic Data Flows (for consensus-critical operations)

AI-Specific Optimization Strategies

Deterministic Batch Processing
For inference/prediction workloads

Technique.
Split input data into fixed-size batches using hash-based ordering
Process batches sequentially with async checkpointing
Accuracy Safeguard.

Code …

// Stable batch ordering using Blake3 hash
let batches = inputs.chunks(100).sorted_by_key(|b| blake3(b));
let results: Vec<Prediction> = batches
.map(|batch| model.predict(batch)) // Deterministic ops
.collect();

Fixed-Precision Math
Prevent floating point divergence

Approach:
Use 16-bit/32-bit fixed point math libraries
Implement deterministic rounding modes
Tools:
fixed crate for Rust
Custom WASM SIMD kernels for tensor ops

Model Sharding
Distribute large models across canisters

Pattern:
Vertical split: Layer 1-3 on Canister A, Layer 4-6 on Canister B
Horizontal split: Head layers replicated, dense layers distributed
Accuracy Check:

Code …

// Cross-canister gradient validation
let grad_a = canister_a.compute_gradients(batch).await;
let grad_b = canister_b.compute_gradients(batch).await;
assert_relative_eq!(grad_a, grad_b, epsilon = 1e-6);

Deterministic Training
Reproducible model updates

Key Methods:
Fixed random seeds with ic-random (network-provided entropy)
Gradient averaging via ordered aggregation
Consensus-based checkpointing
Workflow:

Compute local gradients (sorted input order)
Async aggregation with CRC32 checksums
Update model weights using deterministic optimizer steps
Hybrid Quantization
Balance speed vs precision

Strategy:
FP32 for sensitive layers (attention heads in transformers)
INT8 for dense layers with calibration
Rust Example:

Code …

let quantized = quantize_layer(
weights,
scale: 0.0125, // Calibrated per layer
zero_point: 64
);
let output = dequantize(quantized, scale, zero_point);

hpeebles · March 4, 2025, 12:32pm

Updates each run sequentially but queries can run in parallel.
So if you are able to make use of queries then you’re all sorted.

ielashi · March 4, 2025, 2:58pm

I did explore optimizing the execution of AI inference, and one optimization that would really make a difference would be support multi-core matrix multiplication, as that’s where 99% of all inference computation lies. Having said that, even with that optimization, don’t expect to be able to run models bigger than ~1B parameters efficiently, as for those we’d really need beefier hardware than the current nodes.

Computation aside, there’s also the I/O problem, so even with beefier hardware that allows us to run bigger models, swapping in/out these models to/from VRAM could be very expensive (a 70B parameter model would need anywhere between 10-20 seconds just to be loaded in VRAM). That’s another challenge that we’d still have even with beefier hardware.

That’s why for now we’re exploring the approach of AI workers as outlined here. With this approach, we’d support a handful of foundational LLMs rather than support anyone running any model. It’s a limitation, but it does allow these workers to have these LLMs consistently loaded in RAM for faster inference.

Are you interested specifically in training?

Kurt · March 4, 2025, 3:39pm

Hi Islam,
Thanks for the detailed reply—I really appreciate the insight into the current constraints and the AI workers approach! I’m working on an AI replica startup targeting the bereavement market, where privacy is paramount. My goal is to run Liquid AI models fully on-chain (max 8GB) on ICP, leveraging the security and control of a canister-based setup as our unique selling point. Inference is the priority, but some on-chain fine-tuning is also key to adapt models to individual users, with roughly 60-80% of each model’s structure being replicated across instances for consistency.
Your response got me thinking about how to make this work within ICP’s current framework, so I’ve got a few follow-up questions:
Model Swapping Optimization
For your pre-loaded foundational LLMs in the AI workers setup:
Are you exploring model partitioning (e.g., sharding layers across workers) to reduce per-node memory pressure? This could help fit an 8GB model efficiently.
Could quantized layers or adaptors (like LoRA) keep base models resident in RAM while swapping smaller, fine-tuned components? That might align with my need for lightweight personalization.
AI Workers Architecture
With this approach, would it support:
Custom fine-tuning atop foundational models via consensus-verified diffs? I’m picturing a way to tweak models on-chain without full retraining.
ZK proofs for inference integrity instead of replicating the entire model across nodes? This could cut overhead while keeping things secure and verifiable.
Training Horizon
Since some on-chain training is in scope for us, we’re curious about deterministic SGD possibilities:
Is there research into gradient averaging with fixed-point rounding guarantees? This could ensure reproducible results across canisters.
Any exploration of federated learning patterns using canister-to-device coordination? It might fit our privacy-first model if we can offload some computation to user devices.
I’d love to hear your thoughts on how feasible this is with ICP’s current trajectory—or if there’s a better angle to explore. Thanks again for your time!
Cheers,
Kurt

kevinmcfarlane · March 4, 2025, 3:44pm

Yes, you are right. I misunderstood this. I asked You AI. See especially the answer to my follow-up question.

Kurt · March 5, 2025, 5:31am

@ielashi

Good morning,

Your mention of the AI workers approach prompted me to explore parallel execution further. To that end, I’ve developed a conceptual Motoko implementation, inspired by PyPIM’s principles of deterministic, in-memory AI operations. Please see the code below , it focuses on parallel matrix multiplication via inter-canister parallelism and deterministic chunking. I believe its design, which distributes computation across child canisters and maintains data in memory, could be promising. Might this approach, splitting an 8GB model across canisters, operate effectively within the node constraints you outlined? Or would the I/O limitations still pose a significant challenge?

Building on your previous insights, I have a few additional questions:
Model Swapping Optimisation
Regarding your AI workers, are you considering sharding model layers across nodes to reduce memory demands? This could align with my chunking strategy.

Would quantised layers or LoRA adaptors enable base models to remain resident while swapping smaller components? This might suit my fine-tuning requirements.

AI Workers Architecture
Could your approach accommodate fine-tuning through consensus-verified differences? My chunked updates share some similarities with this concept.

Is there potential to use zero-knowledge proofs for inference integrity rather than full model replication? This could enhance efficiency while maintaining security.

Training Horizon
Given my interest in on-chain training, has there been exploration of deterministic stochastic gradient descent with fixed-point rounding for gradient averaging? My aggregation method seeks to ensure such consistency.

Are there plans to investigate federated learning via canister-to-device coordination? This could support privacy while distributing computational load.

Sorry for all the questions I appreciate you are all busy.

Cheers
Kurt

Kurt · March 5, 2025, 6:00am

Deleted because it was out of context

sea-snake · March 5, 2025, 8:58am

I see SIMD mentioned, maybe this is relevant: https://internetcomputer.org/docs/building-apps/network-features/simd

ielashi · March 5, 2025, 12:03pm

Given the breadth of the questions you ask, how about we have a call, learn about your use-case, and discuss the points you mentioned? Feel free to DM me.

Jdcv97 · March 10, 2025, 2:44am

How does JAM a new technology developed by polkadot fixing all our limitations ? This is something that a developer said.

So on ICP we can execute just 1B parameters, when chat gpt models are 1.8T parameters, is this a joke? And we doesn’t even have a solution yet? How are we gonna be the world computer and solving AI security issues with 1B parameters execution.

The real world use is about finally making decentralized computing competitive with centralized cloud solutions. Not just creating a decentralized cloud computer that can host regular things, but not actually running them at global adoption level speeds.

JAM provides drastically more computing power than any other blockchain today, making it able to actually compete with web2 options in terms of computing power and costs.

While it will cost more for now to use Polakdot. The added bonuses for many will vastly outweigh the costs. You’re paying for added security and verification.

Why don’t I just list some things again.

General Enterprise Adoption… Companies can now run complex computations in a decentralized way, without depending on AWS, Google, or Microsoft.

AI & Machine Learning. Training and running decentralized AI models on-chain, with verifiable, tamper-proof execution. (This cannot be done in other chains, without costing millions, and having ungodly requiring amount of time to train)

Decentralized Finance at Scale. No more gas limitations, allowing high-frequency trading, real-time risk analysis, and automated financial modeling directly on-chain.

Gaming & Metaverse… JAM can actually compute On-chain physics engines, complex game logic, and AI-driven NPCs. all running natively at near-hardware speeds. ( no other chain can even remotely come close to being able to run On-chain physics engines, complex game logic, and AI-driven NPCs all simultaneously for cheap enough, and at fast enough speeds)

Jdcv97 · March 10, 2025, 2:48am

Sorry to say this, but the “solution “ or
Those off chain agents are useless, and that the whole industry web2 and web3 are doing, so we enter in the phase where ICP isn’t at the forefront,
Instead we have fell many steps back.

Kurt · March 10, 2025, 4:55am

I suggested a solution to @ielashi on Friday, but understandably he is busy.
Also was off topic/ thread so I’ll post it again in full.

PyPIM

Matrix Multiplication Bottleneck (99% of Inference Time)
PyPIM’s Plan: Do it all in memory, parallel style.
Bit-Parallel Element-Parallel Arithmetic
Cuts 32-bit multiplication lag by 14× compared to slow bit-serial stuff (Sec. II-B).
Brings the complexity down to O(N log N) for N-bit jobs with clever crossbar tricks (Fig. 4).
Deterministic SIMD-Like Parallelism
Runs 1,024x1,024 NOR gates at once per crossbar (Sec. III-D).
Spot on for tensor stuff like GEMM that AI networks need.
With PyPIM:
You’re looking at 10-100x faster matrix work—CPUs and GPUs can’t touch that.
No more waiting for data to move around for weight-activation multiplies.
Without PyPIM:
You’re stuck, bru. Matrix math takes forever—99% of inference time just sitting there. GPUs help a bit, but the data shuffle kills you.
I/O Bottleneck (Model Loading)
PyPIM’s Fix: Keep it in memory, no stress.
Pre-Loaded Model Weights
Weights stay put in PIM memory with resident tensor allocation (Sec. V-A):
python
model_weights = pim.ones((1024, 1024), dtype=pim.float32) # Stays there, no moving.
No need to keep shifting stuff between VRAM and CPU.
Quantization & Partitioning
Uses 4-bit quantized KV caches on separate crossbars (Sec. VI-A).
Splits layers across crossbars so each node isn’t sweating too much.
With PyPIM:
Loading models? Sorted. Weights chill in memory, ready to go.
Quantization keeps it light—perfect for quick inference.
Without PyPIM:
You’re dragging weights around like a chop—VRAM bottlenecks and CPU swaps make it painful.
No quantization means bigger memory headaches.
Downside:
You need memristive hardware, and ICP nodes don’t have it yet. No hardware, no party.
Scalability to Large Models (~70B Parameters)
PyPIM’s Shot: Scale up, if you can.
Inter-Crossbar H-Tree Communication (Sec. III-F)
Moves data between crossbar groups all hierarchical-like.
Example: Spread transformer layers over 1,024 crossbars, easy peasy.
Hybrid On/Off-Chain Workflows
Push heavy ops (like attention) to PIM, let CPU handle embeddings.
With PyPIM:
Could handle big models like 70B params if you’ve got enough crossbars.
Hybrid setup’s nice—mixes PIM speed with CPU backup.
Without PyPIM:
Scaling’s a mission. CPUs and GPUs battle with memory, and splitting things up gets messy.
No in-memory vibes means you’re forcing it the hard way.
Problem:
Paper’s only got 8GB memristive memory (Table III)—no chance for 70B models needing 280GB+ (fp16).
Want 100B+ params? You’ll need 3D stacked crossbars, and the paper’s quiet about that.
Determinism for ICP Consensus
PyPIM’s Strength: Keeps it predictable.
Fixed-Order Parallelism
Sorts batches with hashes for consistency (Sec. VI-A):
rust
let batches = inputs.chunks(100).sorted_by_key(|b| blake3(b)); # No funny business.
Bitwise-Identical Results—same output every time across nodes.
With PyPIM:
ICP’s consensus loves it—no drift between replicas.
Built for blockchain-level AI work.
Without PyPIM:
Multi-threading’s a gamble—randomness sneaks in, and ICP’s consensus takes strain.
You’re fixing stuff with duct tape instead of doing it proper.
Key Challenges for ICP Adoption
Hardware Dependency
With: PyPIM needs memristive crossbars—ICP’s nodes aren’t there yet.
Without: DFinity’s AI workers keep things going, but it’s not the same.
Model Size vs. Memory Limits
With: 8GB fits small ~1B param models (fp32)—70B’s way out of reach.
Without: Bigger models work on normal setups, but I/O and compute suffer.
Software Integration
With: PyPIM’s NumPy-style API is cool (Fig. 12):
python
z = pim.matmul(x, y) # In-memory, sorted.
But rewriting PyTorch? Jislaaik, that’s work.
Without: Stick with PyTorch—slow but no hassle.
Wrap-Up
With PyPIM (if ICP gets memristive hardware):
Compute: 10-100x faster matrix ops—inference flies.
I/O: Weights stay in memory—no more loading lag.
Determinism: ICP’s consensus stays tight.
Future: 3D stacks could handle >1B param models one day.
Without PyPIM:
Compute: Matrix bottlenecks keep you waiting—CPUs/GPUs can’t hack it.
I/O: Model loading’s a drag—VRAM shuffle slows you down.
Scalability: Big models give you grief without in-memory help.
Now: DFinity’s AI workers and some tweaks keep it running, just.
What to Do:
With: Chat to the PyPIM okes—build a hybrid CPU-PIM setup, test it with Llama 2-7B, and take it from there.
Without: Lean on AI workers, hope for no hiccups, and wait for something better.

Kurt

Kurt · March 10, 2025, 5:26am

What We’re Upgrading To
PyPIM needs memristive crossbar hardware—think fancy memory chips that compute right where the data sits (Sec. III, PyPIM paper). Current ICP nodes don’t have this—they’re beefy servers with 64 CPU cores, 512GB RAM, and 30TB NVMe SSDs (per ICP node specs). To run PyPIM, you’d either:
Add memristive modules to existing nodes (like a GPU-style upgrade).

Replace the whole rig with a new design built around memristive tech.

No clue if ICP nodes have spare slots for this, so we’ll explore both options. Costs depend on hardware, installation, and whatever else these node providers have to deal with.
Option 1: Add Memristive Modules
What’s Needed:
Memristive crossbars (e.g., 1,024x1,024 arrays per Sec. III-D, PyPIM paper).

Interface hardware (controllers, wiring) to hook it into the current setup.

Maybe extra power or cooling—memristors are low-power, but still.

Cost Breakdown:
Memristive Chips: These aren’t mass-market yet. Research-grade memristor arrays (like from Knowm or Crossbar Inc.) can run $100-$1,000 per chip depending on size and yield. Let’s say each crossbar’s 1,024x1,024—call it $500 a pop. PyPIM sims use multiple crossbars (8GB total, Table III), so maybe 10-20 chips per node. That’s $5,000-$10,000 for the arrays.

Controllers: Think FPGA or ASIC to manage the crossbars—$500-$2,000 per node, depending on complexity. Say $1,000.

Installation: Labour, testing, downtime. Node providers pay data centre fees (OPEX), so tack on a few hours at $50-$100/hour—let’s call it $500.

Power/Cooling Upgrades: Memristors sip power (way less than GPUs), so maybe $200-$500 if you need a tweak.

Total Per Node:
Low end: $5,000 (chips) + $1,000 (controller) + $500 (install) + $200 (extras) = ~$6,700.

High end: $10,000 + $2,000 + $500 + $500 = ~$13,000.

Catch: This assumes the current ICP node (e.g., Dell PowerEdge 6525) can handle add-ons. If it’s maxed out on slots or power, you’re stuffed—this won’t fly.
Option 2: Full Node Replacement
What’s Needed:
A new server built from scratch with memristive tech as the core—CPUs, RAM, SSDs, plus the crossbars.

Same 8GB memristive capacity (PyPIM’s baseline) but scaled for ICP’s 512GB RAM and 30TB storage needs.

Cost Breakdown:
Current ICP Node Cost: Gen-1 nodes (e.g., PowerEdge 6525) are high-end—retail’s ~$20,000-$30,000 each (CPU, RAM, SSDs, etc.). Node providers might get bulk deals, say $15,000-$25,000.

Memristive Overhaul: Swap some RAM/CPU grunt for crossbars. Keep 30TB SSDs ($5,000-$7,000), drop to 256GB RAM ($2,000), smaller CPU (32 cores, $1,500), then add $5,000-$10,000 in crossbars and $1,000 controller. Base chassis and extras (power, cooling, networking) ~$5,000.

New Total: $5,000 (SSDs) + $2,000 (RAM) + $1,500 (CPU) + $7,500 (crossbars avg) + $1,000 (controller) + $5,000 (base) = ~$22,000-$27,000.

Delta: If a current node’s $20,000, the upgrade cost is the difference—$2,000-$7,000 extra per node.

Catch: This assumes memristive tech scales cheaply. If 8GB jumps to 512GB-equivalent compute capacity (PyPIM’s dream), crossbar costs could balloon—$50,000+ easy. No data on that yet.
Other Costs
Shipping: Nodes go to data centres worldwide. $500-$1,000 per node, depending on distance.

NNS Approval: Node providers submit proposals to ICP’s Network Nervous System—free, but time’s a factor.

OPEX: Data centre rent, power, bandwidth. Memristors might drop power costs (less heat than CPUs), but base OPEX is $1,000-$2,000/month per node—unchanged unless scaled up.

How Many Nodes?
ICP’s got hundreds of nodes across subnets (exact count’s fuzzy—DFinity’s tight-lipped, but think 300-500 based on subnet topology). Per node cost scales network-wide:
Option 1: $6,700-$13,000 × 400 nodes = $2.7M-$5.2M.

Option 2 Delta: $2,000-$7,000 × 400 = $0.8M-$2.8M.

Reality Check
Availability: Memristive hardware’s still niche—HP and Intel have prototypes, but mass production’s years off. Costs could drop (or spike) by 2025.

ICP Fit: PyPIM’s 8GB sim (Table III) is tiny vs. ICP’s 512GB RAM nodes. Scaling memristors to match might push costs way higher—$50,000-$100,000 if you need hundreds of crossbars.

Rewards: Node providers earn ~$1.5M/month network-wide in ICP tokens (pegged to USD/SDR). A $13,000 upgrade pays back quick if rewards hold—months, not years.

Final Guess
Per Node Cost:
Add-On: $6,700-$13,000 if it’s just modules. Lekker if it works.

Replacement Delta: $2,000-$7,000 on top of current cost—cheaper but riskier.

Total Network Cost: Millions, depending on scale and adoption.

Kurt · March 10, 2025, 9:26am

Topic		Replies	Views
The complexities of supporting multithreaded canister execution Developers	2	654	October 1, 2021
Canister:limit for single message execution Developers	6	930	July 4, 2022
Canister Multithreads Roadmap	22	1500	November 15, 2021
The future: Deterministic parallelism within a canister? (i.e. multi-core and many-core canisters) Developers	2	1075	August 7, 2020
Can I run multiple inter-canister update calls in parallel? Developers	7	1204	May 19, 2022

Parallel execution for canisters

Canisters as actors​

Related topics

Canisters as actors