Long Term R&D: Node Performance (proposal)

stefan-kaestle · December 10, 2021, 12:45pm

Performance motion proposal

Objective

The current design and implementation of the IC protocol has been focused on simplicity and low engineering effort. While this suffices to demonstrate the possibilities of canister smart contracts, unleashing the full capacity of the IC requires a high-performance design and implementation. The goal of this project is to address bottlenecks and thus increase the performance of individual IC nodes, so that more query and update requests can be executed in the same amount of time.

This will involve improvements such as to the orthogonal persistence mechanism, NIC virtualization, OS and Canister scheduling, caching and the investigation of HW accelerators for compute-intensive tasks.

Background

In order to handle growing demand for new and more elaborate applications, the IC needs to be able to scale to those new requirements.

Broadly speaking, there are two approaches to scalability: a system can scale out by adding more hardware (in our case, more subnetworks). This often happens alongside some form of partitioning, as parts of that extra hardware are supposed to run independently from each other. This partitioning often implies a loss of locality as application logic might have remote state.

In contrast, systems can also scale up by increasing the performance of individual machines by making them more powerful or achieving better resource utilization.

Both are important. This proposal focuses on scale-up, there is a separate proposal on the scale-out.

Motivation

One approach to scalability of the IC is to scale out, i.e.:

By adding more subnetworks to increase update request rate
By adding more nodes to increase query request rate

Scaling out, however, comes at a cost due to less locality and extra communication across subnetworks:

Xnet messages to communicate across subnetwork boundaries
Increased message complexity for Consensus with more nodes per subnetwork

Because of those costs, relying exclusively on scale out to increase the IC capacity is insufficient. It is also important to scale up performance of individual nodes to achieve higher query/update request rates without adding additional nodes.

This also improves machine utilization, cost of operating hardware, as well as energy consumption.

Topics under this project

In order to achieve our goals, the current bottlenecks must first be investigated, followed by the design and implementation of suitable mitigation strategies. This will involve research and development effort in many components and require a diverse set of engineers and researchers with expertise in Distributed Systems, Cryptography, Performance Analysis and Management, Operating Systems, and Networking and is expected to take multiple years.

The following are some examples of areas for optimizations that promise to improve IC node performance:

Orthogonal persistence (OP): the IC uses orthogonal persistence as a programming abstraction to simplify memory management for canister smart contract developers. The IC OP implementation also determines resource consumption as the canister needs to be charged for its use of memory.
Locality of data and code: TLB misses and page faults are extremely expensive on modern hardware and can severely hurt overall application performance. All existing code has to be measured to find occurrences of the ones that hurt performance.
Synchronization primitives, lock contention: The IC is a platform that exhibits a large degree of parallelism, as requests to different canisters and query calls to the same canister can be executed in parallel. With a high degree of parallelism, synchronization and coordination across concurrent tasks can quickly become a bottleneck that prevents leveraging the available parallelism that the workload would permit.
User-level networking: traditionally, a syscall is needed to send data across the network, since the OS needs to make sure concurrent and safe access to shared resources is provided to all processes running on the OS. Such overheads can be minimized by means of user-level network stacks, where NIC virtualization technology allows to expose virtual send and receive queues that can be directly accessed by processes without having to enter the kernel.
Light-weight user-level scheduling: avoid expensive syscalls and entering the OS kernel when switching from one task to another.
Scheduler optimizations: as IC subnetworks run state machine replication, scheduling decisions for the order in which requests are going to be executed have to be deterministic. However, suboptimal schedules limit resource utilization, since there is less flexibility locally. For example, this makes it hard to yield resources to other tasks while waiting for resources. Thus, optimizing the IC scheduler will likely require us to research how to optimize schedulers in this niche.
Caching: many operations on IC nodes have to be executed repeatedly. This temporal locality motivates exploring the use of caching to improve hardware utilization. An example for this is signature verification.
HW acceleration: many hardware accelerator chips exist that are optimized for certain tasks and typically achieve higher performance and lower energy consumption. The use of those likely makes sense in the context of the IC. An example are Smart NICs that offload network checksum calculation and help to avoid entering the kernel for each package that is transmitted to and from networks. Other examples are crypto accelerators optimized for cryptographic operations such as signature checking.

Key milestones

Benchmarking suites to identify bottlenecks
Model typical applications running on the IC to use as workload for bottleneck analysis
Expected performance improvements: one order of magnitude for the number of requests processed per machine per second without hurting request latency.

Based on those benchmarks, decisions on where optimizations are most likely to improve performance can be made and addressed in collaboration with experts from the respective components.

Discussion Lead

@stefan-kaestle

Why the DFINITY Foundation should make this a long-running R&D project

The success of the IC crucially depends on its usefulness for executing general purpose applications and workloads.

If the IC is to truly achieve blockchain singularity, performance (and cost of execution) has to be similar to centralized cloud provider platforms.

Skills and Expertise necessary to accomplish this

For this effort, experts across the entire stack are needed. Starting from engineers with a deep understanding of hardware characteristics to developers familiar with the Linux kernel up to the application level, where code will have to be restructured in order to achieve a better utilization of available hardware resources.

Open Research questions

How useful are hardware accelerators?
- Crypto hardware
- Smart NICs
- Near-memory processing
- NVM hardware
Can performance competitive to centralized cloud solutions be achieved?

Examples where community can integrate into project

Suggestions for meaningful workloads
Provide more application benchmarks
Suggestions for and implementation of performance improvements

What we are asking the community

Review comments, ask questions, give feedback
Vote accept or reject on NNS Motion

Topic		Replies	Views
Motion Proposals on Long Term R&D Plans Roadmap	17	8118	June 10, 2022
Long Term R&D: Scalability (proposal) Roadmap	18	4794	April 27, 2022
Long Term R&D: Boundary Nodes (proposal) Roadmap	41	5254	May 24, 2022
Long Term R&D: Malicious Node Security (proposal) Roadmap	5	2208	August 16, 2023
Long Term R&D: Canister Migration (proposal) Roadmap	10	1519	July 24, 2024