Performance motion proposal
Objective
The current design and implementation of the IC protocol has been focused on simplicity and low engineering effort. While this suffices to demonstrate the possibilities of canister smart contracts, unleashing the full capacity of the IC requires a high-performance design and implementation. The goal of this project is to address bottlenecks and thus increase the performance of individual IC nodes, so that more query and update requests can be executed in the same amount of time.
This will involve improvements such as to the orthogonal persistence mechanism, NIC virtualization, OS and Canister scheduling, caching and the investigation of HW accelerators for compute-intensive tasks.
Background
In order to handle growing demand for new and more elaborate applications, the IC needs to be able to scale to those new requirements.
Broadly speaking, there are two approaches to scalability: a system can scale out by adding more hardware (in our case, more subnetworks). This often happens alongside some form of partitioning, as parts of that extra hardware are supposed to run independently from each other. This partitioning often implies a loss of locality as application logic might have remote state.
In contrast, systems can also scale up by increasing the performance of individual machines by making them more powerful or achieving better resource utilization.
Both are important. This proposal focuses on scale-up, there is a separate proposal on the scale-out.
Motivation
One approach to scalability of the IC is to scale out, i.e.:
- By adding more subnetworks to increase update request rate
- By adding more nodes to increase query request rate
Scaling out, however, comes at a cost due to less locality and extra communication across subnetworks:
- Xnet messages to communicate across subnetwork boundaries
- Increased message complexity for Consensus with more nodes per subnetwork
Because of those costs, relying exclusively on scale out to increase the IC capacity is insufficient. It is also important to scale up performance of individual nodes to achieve higher query/update request rates without adding additional nodes.
This also improves machine utilization, cost of operating hardware, as well as energy consumption.
Topics under this project
In order to achieve our goals, the current bottlenecks must first be investigated, followed by the design and implementation of suitable mitigation strategies. This will involve research and development effort in many components and require a diverse set of engineers and researchers with expertise in Distributed Systems, Cryptography, Performance Analysis and Management, Operating Systems, and Networking and is expected to take multiple years.
The following are some examples of areas for optimizations that promise to improve IC node performance:
- Orthogonal persistence (OP): the IC uses orthogonal persistence as a programming abstraction to simplify memory management for canister smart contract developers. The IC OP implementation also determines resource consumption as the canister needs to be charged for its use of memory.
- Locality of data and code: TLB misses and page faults are extremely expensive on modern hardware and can severely hurt overall application performance. All existing code has to be measured to find occurrences of the ones that hurt performance.
- Synchronization primitives, lock contention: The IC is a platform that exhibits a large degree of parallelism, as requests to different canisters and query calls to the same canister can be executed in parallel. With a high degree of parallelism, synchronization and coordination across concurrent tasks can quickly become a bottleneck that prevents leveraging the available parallelism that the workload would permit.
- User-level networking: traditionally, a syscall is needed to send data across the network, since the OS needs to make sure concurrent and safe access to shared resources is provided to all processes running on the OS. Such overheads can be minimized by means of user-level network stacks, where NIC virtualization technology allows to expose virtual send and receive queues that can be directly accessed by processes without having to enter the kernel.
- Light-weight user-level scheduling: avoid expensive syscalls and entering the OS kernel when switching from one task to another.
- Scheduler optimizations: as IC subnetworks run state machine replication, scheduling decisions for the order in which requests are going to be executed have to be deterministic. However, suboptimal schedules limit resource utilization, since there is less flexibility locally. For example, this makes it hard to yield resources to other tasks while waiting for resources. Thus, optimizing the IC scheduler will likely require us to research how to optimize schedulers in this niche.
- Caching: many operations on IC nodes have to be executed repeatedly. This temporal locality motivates exploring the use of caching to improve hardware utilization. An example for this is signature verification.
- HW acceleration: many hardware accelerator chips exist that are optimized for certain tasks and typically achieve higher performance and lower energy consumption. The use of those likely makes sense in the context of the IC. An example are Smart NICs that offload network checksum calculation and help to avoid entering the kernel for each package that is transmitted to and from networks. Other examples are crypto accelerators optimized for cryptographic operations such as signature checking.
Key milestones
- Benchmarking suites to identify bottlenecks
- Model typical applications running on the IC to use as workload for bottleneck analysis
- Expected performance improvements: one order of magnitude for the number of requests processed per machine per second without hurting request latency.
Based on those benchmarks, decisions on where optimizations are most likely to improve performance can be made and addressed in collaboration with experts from the respective components.
Discussion Lead
Why the DFINITY Foundation should make this a long-running R&D project
The success of the IC crucially depends on its usefulness for executing general purpose applications and workloads.
If the IC is to truly achieve blockchain singularity, performance (and cost of execution) has to be similar to centralized cloud provider platforms.
Skills and Expertise necessary to accomplish this
For this effort, experts across the entire stack are needed. Starting from engineers with a deep understanding of hardware characteristics to developers familiar with the Linux kernel up to the application level, where code will have to be restructured in order to achieve a better utilization of available hardware resources.
Open Research questions
- How useful are hardware accelerators?
- Crypto hardware
- Smart NICs
- Near-memory processing
- NVM hardware
- Can performance competitive to centralized cloud solutions be achieved?
Examples where community can integrate into project
- Suggestions for meaningful workloads
- Provide more application benchmarks
- Suggestions for and implementation of performance improvements
What we are asking the community
- Review comments, ask questions, give feedback
- Vote accept or reject on NNS Motion