Scalability Motion proposal
Summary
The current version of the Internet Computer protocol provides a solid foundation for the execution of canister smart contracts which can be enhanced and extended to realize the IC’s potential to serve millions of smart contracts. More precisely, the following shall be investigated and suitable mechanisms designed, analysed, implemented, and tested:
- Many more subnets with frequent inter-canister calls across different subnets. The communication overhead for two canisters running on different subnets should be kept to a minimum and new nodes should be able to join subnets fast even if the state of the canisters they host is huge.
- Large subnets with hundreds of nodes to tolerate a higher number of faulty nodes, yet exhibit low latency and high throughput for update calls.
- Developers and users sending large messages to canisters as well as inter-canister messages.
To achieve this, the current bottlenecks are investigated and then suitable mitigation strategies are designed and implemented. This will involve research and development efforts in many components and require a diverse set of engineers and researchers with expertise in Distributed Systems, Cryptography, Performance Analysis and Management, Operating Systems, and Networking and is expected to take multiple years.
1. Objective
Design, analyse, implement, and test a more scalable version of the IC protocol to achieve higher throughput, more nodes, efficient routing across subnetworks, faster state synchronization, and automatic distributed load balancing despite malicious entities participating in the protocol.
2. Background
The IC currently comprises a system subnet for the NNS canisters running on 37 nodes and a bit more than two dozen app subnets each running on 13 nodes each and maintaining a total of a little over 400 GB of state. Most of the current traffic consists of update and query calls from users to smart contract canisters. With the projected growth and the smart contract canisters becoming more sophisticated, not only will the load on the IC increase, it will also change to feature more inter-canister messages. Moreover, higher security requirements on system subnets (and potentially other subnet types) benefit from higher numbers of nodes per subnet…
3. Why this is important
This project will help developers and consumers of IC dapps in the following ways:
- Decreased latency combined with higher throughput & availability - Thanks to the work described in this motion proposal, future versions of the IC protocol and its implementation will be more scalable and robust. As a consequence, developers and users will benefit from decreased latency and higher throughput and availability.
- Lower the overhead experienced by users - An enhanced management of node and subnet configuration will simplify adding more nodes and subnets, which in turn alleviates the load per subnet and an improved routing for messages between subnets will lower the overhead experienced by users further.
- Orthogonal to these measures, a higher number of nodes in important subnets like the NNS subnet or other system subnets is desirable to make them tolerate more faulty nodes.
4. Topics under this project
In order to enhance the IC, an improved version supports more subnets with more nodes, larger messages and an efficient backup mechanism.
The design, implementation, testing, and analysis of these features will rely on research and development on the following non-exhaustive list of topics:
- New routing mechanisms with frequent inter-canister calls across different subnets, relying on deterministic or randomized overlays
- Traffic shaping and scheduling algorithms to provide quality of service guarantees despite Byzantine players
- Construction of overlays for efficient message dissemination within and between subnets and a robust mechanism to adapt the overlays over time.
- Efficient mechanisms to (re)join a subnet, peer discovery and state synchronization
- Chunking at different layers of the IC stack
- Speculative algorithms for the execution and verification of canister message processing
- Separating the dissemination of messages to canisters from ordering them
- Erasure coding for reliable broadcast
- Distributed monitoring and analysis
- Node and subnet configuration and key management in the NNS registry canister
5. Key milestones
The following list of potential milestones indicate the ambitions for this project and will be adapted to suitable values as the corresponding work packages are tackled.
Input on the prioritization of the different milestones is highly appreciated, the current numbering is not supposed to indicate priorities or a sequence in which they will be achieved.
- Many subnets: Latency of 90% of the inter-canister messages on 1000 subnets with a “typical” workload distribution is below 5s
- Large subnets: Subnets with 200 nodes achieve a block rate of 1/s
- Resilience: No bad peer can deteriorate the throughput and latency by more than 5% in and across subnets of at least 50 nodes
- Large messages: Update and query requests of up to 10GB are supported
6. Discussion leads
Yvonne-Anne Pignolet is driving the motion proposal, Yotam Harchol, David Derler Manu Drijvers Thomas Locher and other team members will be available for discussions.
7. Why the DFINITY Foundation should make this a long-running R&D project
To support the projected growth and adoption rate with an outstanding development and user experience, the Internet Computer must be enabled to scale out and adapt to its workload automatically while preserving its security guarantees. Therefore, the DFINITY Foundation is committed to investigating and designing the next generation of the IC protocol by solving the above-mentioned problems, so the IC can cope with very high loads and tolerate Byzantine nodes.
8. Skills and Expertise necessary to accomplish this
Tackling the challenges mentioned above relies on research and development efforts in many components and requires a diverse set of engineers and researchers with expertise in Distributed Systems, Cryptography, Performance Analysis and Management, Operating Systems, and Networking and is expected to take multiple years.
9. Open Research questions
- How can overlays for the communication between and within subnets of different sizes be constructed and used efficiently?
- How can the management of node and subnet configuration and keys be distributed among multiple canisters?
- How can a node (re)joining a subnet discover its current overlay peers and obtain the necessary information to participate in the protocol quickly?
- How do subnets discover which other subnets have canister messages for them and how do they schedule their nodes to exchange them?
- How can the traffic be shaped to utilize the available bandwidth between nodes in the same and different data centers efficiently despite malicious nodes?
- How can one design OS and networking scheduling algorithms to provide Quality of service guarantees despite Byzantine players?
- How can large messages be partitioned, stored and disseminated efficiently on the different layers of the IC stack?
- How can one design speculative algorithms for the execution and verification of canister message processing?
- How can the dissemination of messages be separated from ordering them with minimal overhead?
- How can the Internet Computer be monitored and analysed in a distributed fashion?
10. Examples where community can integrate into project
Due to the wide scope of required expertise for this motion proposal, it is expected that it will be carried out in tight interaction with the community. In particular, it is planned to organise workshops as the motion proposal evolves to discuss priorities, solution approaches, and implementation. Furthermore, a critical assessment and discussion regarding the security and growth and usage assumptions is of strong interest.
11. What we are asking the community
-
Review comments, ask questions, give feedback
-
Vote accept or reject on NNS Motion