Long Term R&D: Node Performance (proposal)

diegop · December 7, 2021, 2:01am

1. Summary

This is a motion proposal for the long-term R&D of the DFINITY Foundation, as part of the follow-up to this post: Motion Proposals on Long Term R&D Plans (Please read this post for context).

This project’s objective

The current design and implementation of the IC protocol has been focused on simplicity and low engineering effort. While this suffices to demonstrate the possibilities of canister smart contracts, unleashing the full capacity of the IC requires a high-performance design and implementation. The goal of this project is to address bottlenecks and thus increase the performance of individual IC nodes, so that more query and update requests can be executed in the same amount of time.

This will involve improvements to the orthogonal persistence mechanism, NIC virtualization, OS and Canister scheduling, caching, and the investigation of HW accelerators for compute-intensive tasks.

2. Discussion lead

Stefan Kaestle

3. How this R&D proposal is different to previous types

Previous motion proposals have revolved around specific features and tended to have clear, finite goals that are delivered and completed. They tended to be measured in days, weeks, or months.

These motion proposals are different and are defining the long-term plan that the foundation will use, e.g., for hiring and organizational build-out. They have the following traits and patterns:

Their scope is years, not weeks or months as in previous NNS motions
They have a broad direction but are active areas of R&D so they do not have an obvious line of execution.
They involve deep research in cryptography, networking, distributed systems, language, virtual machines, operating systems.
They are meant to match the strengths of where the DFINITY foundation’s expertise is best suited.
Work on these proposals will not start immediately.
There will be many follow-up discussions and proposals on each topic when work is underway and smaller milestones and tasks get defined.

An example may be the R&D for “Scalability” where there will be a team investigating and improving the scalability of the IC at various stages. Different bottlenecks will surface and different goals will be met.

3. How this R&D proposal is similar to what we have seen

We want to double down on the behaviors we think have worked well. These include:

Publicly identifying owners of subject areas to engage and discuss their thinking with the community
Providing periodic updates to the community as things evolve, milestones reached, proposals are needed, etc…
Presenting more and more R&D thinking early and openly.

This has worked well for the last 6 months so we want to repeat this pattern.

4. Next Steps

Developer forum intro posted
1-pager from the discussion lead posted
NNS Motion proposal submitted

5. What we are asking the community

Ask questions
Read 1-pager
Give feedback
Vote on the motion proposal

Frankly, we do not expect many nitty-gritty details because these are meant to address projects that go on for long time horizons.

The DFINITY foundation’s only goal is to improve the adoption of the IC so we want to sanity-check the projects we see necessary for growing the IC by having you (the ICP community) tell us what you all think of these active R&D threads we have.

6. What this means for the existing Roadmap or Projects

In terms of the current roadmap and proposals executed, those are still being worked on and have priority.

An intellectually honest way to look at this long-term R&D project is to see them as the upstream or “primordial soup” from which more baked projects emerge from. With this lens, these proposals are akin to asking, “what kind of specialties or strengths do we want to make sure DFINITY foundation has built up?”

Most (if not all) projects that the DFINITY foundation has executed or is executing are borne from long-running R&D threads. Even when community feedback tells the foundation, “we need X” or “Y does not work”, it is typically the team with the most relevant R&D area that picks up the short-term feature or project.

diegop · December 7, 2021, 4:46am

Please note:

Some folks gave asked if they should vote to “reject” any of the Long Term R&D projects as a way to signal prioritization. The answer is simple: “No, please, ACCEPT”

These long-term R&D projects are the DFINITY’s foundation’s thesis at R&D threads it should have across years (3 years is the number we sometimes use internally). We are asking the community to ACCEPT (pending 1-pager and more community feedback of course). Prioritization can come at a separate step.

stefan-kaestle · December 10, 2021, 12:43pm

Hi all, I am Stefan, researcher at DFINITY. My interests are in operating systems, runtime systems and distributed systems. Before working on the Internet Computer, I worked on distributed graph processing and the Barrelfish research operating system.

I am happy to be leading discussions around this proposal on behalf of the many engineers and researchers that are helping to improve performance of IC nodes.

stefan-kaestle · December 10, 2021, 12:45pm

Performance motion proposal

Objective

The current design and implementation of the IC protocol has been focused on simplicity and low engineering effort. While this suffices to demonstrate the possibilities of canister smart contracts, unleashing the full capacity of the IC requires a high-performance design and implementation. The goal of this project is to address bottlenecks and thus increase the performance of individual IC nodes, so that more query and update requests can be executed in the same amount of time.

This will involve improvements such as to the orthogonal persistence mechanism, NIC virtualization, OS and Canister scheduling, caching and the investigation of HW accelerators for compute-intensive tasks.

Background

In order to handle growing demand for new and more elaborate applications, the IC needs to be able to scale to those new requirements.

Broadly speaking, there are two approaches to scalability: a system can scale out by adding more hardware (in our case, more subnetworks). This often happens alongside some form of partitioning, as parts of that extra hardware are supposed to run independently from each other. This partitioning often implies a loss of locality as application logic might have remote state.

In contrast, systems can also scale up by increasing the performance of individual machines by making them more powerful or achieving better resource utilization.

Both are important. This proposal focuses on scale-up, there is a separate proposal on the scale-out.

Motivation

One approach to scalability of the IC is to scale out, i.e.:

By adding more subnetworks to increase update request rate
By adding more nodes to increase query request rate

Scaling out, however, comes at a cost due to less locality and extra communication across subnetworks:

Xnet messages to communicate across subnetwork boundaries
Increased message complexity for Consensus with more nodes per subnetwork

Because of those costs, relying exclusively on scale out to increase the IC capacity is insufficient. It is also important to scale up performance of individual nodes to achieve higher query/update request rates without adding additional nodes.

This also improves machine utilization, cost of operating hardware, as well as energy consumption.

Topics under this project

In order to achieve our goals, the current bottlenecks must first be investigated, followed by the design and implementation of suitable mitigation strategies. This will involve research and development effort in many components and require a diverse set of engineers and researchers with expertise in Distributed Systems, Cryptography, Performance Analysis and Management, Operating Systems, and Networking and is expected to take multiple years.

The following are some examples of areas for optimizations that promise to improve IC node performance:

Orthogonal persistence (OP): the IC uses orthogonal persistence as a programming abstraction to simplify memory management for canister smart contract developers. The IC OP implementation also determines resource consumption as the canister needs to be charged for its use of memory.
Locality of data and code: TLB misses and page faults are extremely expensive on modern hardware and can severely hurt overall application performance. All existing code has to be measured to find occurrences of the ones that hurt performance.
Synchronization primitives, lock contention: The IC is a platform that exhibits a large degree of parallelism, as requests to different canisters and query calls to the same canister can be executed in parallel. With a high degree of parallelism, synchronization and coordination across concurrent tasks can quickly become a bottleneck that prevents leveraging the available parallelism that the workload would permit.
User-level networking: traditionally, a syscall is needed to send data across the network, since the OS needs to make sure concurrent and safe access to shared resources is provided to all processes running on the OS. Such overheads can be minimized by means of user-level network stacks, where NIC virtualization technology allows to expose virtual send and receive queues that can be directly accessed by processes without having to enter the kernel.
Light-weight user-level scheduling: avoid expensive syscalls and entering the OS kernel when switching from one task to another.
Scheduler optimizations: as IC subnetworks run state machine replication, scheduling decisions for the order in which requests are going to be executed have to be deterministic. However, suboptimal schedules limit resource utilization, since there is less flexibility locally. For example, this makes it hard to yield resources to other tasks while waiting for resources. Thus, optimizing the IC scheduler will likely require us to research how to optimize schedulers in this niche.
Caching: many operations on IC nodes have to be executed repeatedly. This temporal locality motivates exploring the use of caching to improve hardware utilization. An example for this is signature verification.
HW acceleration: many hardware accelerator chips exist that are optimized for certain tasks and typically achieve higher performance and lower energy consumption. The use of those likely makes sense in the context of the IC. An example are Smart NICs that offload network checksum calculation and help to avoid entering the kernel for each package that is transmitted to and from networks. Other examples are crypto accelerators optimized for cryptographic operations such as signature checking.

Key milestones

Benchmarking suites to identify bottlenecks
Model typical applications running on the IC to use as workload for bottleneck analysis
Expected performance improvements: one order of magnitude for the number of requests processed per machine per second without hurting request latency.

Based on those benchmarks, decisions on where optimizations are most likely to improve performance can be made and addressed in collaboration with experts from the respective components.

Discussion Lead

@stefan-kaestle

Why the DFINITY Foundation should make this a long-running R&D project

The success of the IC crucially depends on its usefulness for executing general purpose applications and workloads.

If the IC is to truly achieve blockchain singularity, performance (and cost of execution) has to be similar to centralized cloud provider platforms.

Skills and Expertise necessary to accomplish this

For this effort, experts across the entire stack are needed. Starting from engineers with a deep understanding of hardware characteristics to developers familiar with the Linux kernel up to the application level, where code will have to be restructured in order to achieve a better utilization of available hardware resources.

Open Research questions

How useful are hardware accelerators?
- Crypto hardware
- Smart NICs
- Near-memory processing
- NVM hardware
Can performance competitive to centralized cloud solutions be achieved?

Examples where community can integrate into project

Suggestions for meaningful workloads
Provide more application benchmarks
Suggestions for and implementation of performance improvements

What we are asking the community

Review comments, ask questions, give feedback
Vote accept or reject on NNS Motion

diegop · December 20, 2021, 7:33pm

Proposal is live! Internet Computer Network Status

CFWHISPERER · January 26, 2023, 10:56pm

Since 2001 I have been working on capacity planning/scalability projects for all sizes of National and International entities such a NASA, the FAA etc; these as an independent consultant. My knowledge is drawn from all levels of the technology stack, OS’s, Network, Database and Software Code impacts. My main load generating expertise exists with Apache JMeter and as a member of the community I would like to offer my assistance if and when needed.

yvonneanne · January 30, 2023, 4:34pm

Welcome to the forum, CFWHISPERER, and thanks for your offer to help us! Much appreciated!

What type of metrics are you most interested in? Where do you see the biggest needs/benefits from reporting on the ICs performance and improvements we’re working on?

Topic		Replies	Views
Motion Proposals on Long Term R&D Plans Roadmap	17	8122	June 10, 2022
Long Term R&D: Scalability (proposal) Roadmap	18	4797	April 27, 2022
Long Term R&D: Boundary Nodes (proposal) Roadmap	41	5261	May 24, 2022
Long Term R&D: Malicious Node Security (proposal) Roadmap	5	2210	August 16, 2023
Long Term R&D: Canister Migration (proposal) Roadmap	10	1521	July 24, 2024