Parallel execution for canisters

I did explore optimizing the execution of AI inference, and one optimization that would really make a difference would be support multi-core matrix multiplication, as that’s where 99% of all inference computation lies. Having said that, even with that optimization, don’t expect to be able to run models bigger than ~1B parameters efficiently, as for those we’d really need beefier hardware than the current nodes.

Computation aside, there’s also the I/O problem, so even with beefier hardware that allows us to run bigger models, swapping in/out these models to/from VRAM could be very expensive (a 70B parameter model would need anywhere between 10-20 seconds just to be loaded in VRAM). That’s another challenge that we’d still have even with beefier hardware.

That’s why for now we’re exploring the approach of AI workers as outlined here. With this approach, we’d support a handful of foundational LLMs rather than support anyone running any model. It’s a limitation, but it does allow these workers to have these LLMs consistently loaded in RAM for faster inference.

Are you interested specifically in training?

4 Likes