The future: Deterministic parallelism within a canister? (i.e. multi-core and many-core canisters)

I’ve spent a lot of time thinking about the absence of parallel computation within a single query or update call to a canister (and more generally, in programming languages). I thought I’d think aloud here in case someone has an enlightening contribution to make. This is just my individual reflection on the challenge of providing the simplest possible infrastructure for scalable programs. There are obviously bigger fish to fry for the Dfinity team right now.

So first of all, there is an obvious need for parallelism (N.B: not concurrency) within a single query/update call in order to maximise the transparent scalability of an app. This is particularly true of update calls, since they must be serialized, whereas multiple query calls can already run in parallel. There is a lot of untapped hardware potential – a CPU has many cores, and a GPU has even more – but unfortunately, Wasm code today runs single-threaded. It would be extremely useful to be able to accelerate our update calls by an order of magnitude or two, but the Internet Computer needs to expose the right constructs to make this possible. We need to offer developers a set of constructs that the infrastructure knows how to parallelize.

We don’t need concurrency constructs to express parallelism. Observable concurrency in a parallelized program is merely an exposed execution detail, and should be avoided at any cost. It’s particularly pernicious for the IC because observable concurrency is usually non-deterministic (esp. fine-grained CPU concurrency), and thus isn’t practical in the presence of canister replicas.

But isn’t observable concurrency a necessary foundation for expressing parallelism? No, it isn’t necessary at the software level, though it’s unfortunately a commonly-chosen foundation within traditional programming systems. Whilst many designs involve exposed concurrency-management constructs (e.g. threads, locks, messages, queues, transactional memory), this is an unnecessary delegation of work from the infrastructure to the developer. Parallelized programs have the same semantics as serial programs, i.e. they compute the same result. Thus, the aforementioned concurrency constructs can’t possibly be essential features of a parallel system, since they don’t add new capabilities. They’re optional features, for scenarios where developers want to obsess over the fine details of hardware execution.

What high-level primitives should we then be exposing that allow auto-parallelization by the infrastructure? Well, this is the part I can’t answer completely, but I have a rough idea. There are common idioms like MapReduce (and functional transformations like map/filter) that hint at what we need. We need the IC to expose constructs that express transformations over data sets, the key word being set. Operations can be parallelized when execution order is flexible, and this predominantly occurs when data is stored in a set (i.e. is unordered), or the operation being applied to the data treats the data as if it were a set (as a “map” function does).

Great! So let’s write a bunch of set transformations. This is where things get a little disappointing. Low-level imperative programming languages (such as Wasm) don’t know the meaning of a set. Every operation is conceptually performed in a total order, and operates over a totally-ordered data store (i.e. linear memory). You can’t even express the doubling of array elements without a sequential loop that says “start at 0 and go up”, or something similar. Modern CPUs actually do the best they can to find parallelism amongst this total ordering of instructions (this logic actually takes up most of the non-cache chip area of a CPU), but they only have a tiny window within which they can “see” opportunities to parallelize. To parallelize programs as much as is possible, we need language constructs that can be mapped intelligently to multi-core hardware. This would have the side benefit of actually allowing CPU designers to innovate on chip architecture, since they wouldn’t have to worry about “nobody knows how to use more cores effectively”.

The Dfinity team is going to have to solve the problem of deterministic parallelism eventually, and I think it will have to entail the exposure of high-level parallelizable constructs to developers. I would be very surprised if there’s an efficient hardware-level solution for synchronizing replicas, and either way, in the year 2020 I think developers are finally ready for high-level constructs to be the default means of boosting hardware performance, not low-level thread/lock/queue/memory-wrangling. I hope there will be a research project at Dfinity along these lines at some point!

Anyway, for those readers who know much about parallel programming, feel free to add your thoughts (or tell me I’m mistaken about something). I don’t expect to get much of a response here really, because the Dfinity forum is still a bit of a graveyard. But I just felt like brain-dumping, and maybe I can plant a seed in somebody’s mind.

(Note also: I’m not referencing the actor model of the IC at all here. Actors are useful and necessary for expressing the essential concurrency that is observable in a distributed system manipulated by billions of people. But that usage of the actor model isn’t (purely) about “boosting performance”. A usage of actors purely for parallelization is equivalent to threads with thread-local memory and message queues, and thus I consider it to be unnecessarily low-level.)

8 Likes

Thanks for the thoughtful notes! Not as an official answer, but SIMD is likely a low hanging fruit, and is already available for some Wasm implementations. Or run a multi-canisters setup, which is a bit heavy handed at the moment because that is not the same as instantiating a set of tasks, run and delete them afterwards.

2 Likes

Oh, yes I forgot to mention SIMD! It would definitely help, though it doesn’t scale up to the equivalent parallelism of a 32-core CPU or 1024-core GPU.

Multi-canister setups unfortunately make achieving parallelism a programmer’s problem :confused:, and a distributed programming problem at that. Not to mention that communication latencies go from nanoseconds to milliseconds (1 million times higher).

4 Likes