For completeness, we decided to also publish the motion proposal that we plan to submit on Monday. It combines the first post in this thread the the extension posted earlier today.
Decentralized Node Management
The Internet Computer (IC) is formed by standardized “node machines”. These machines are owned by independent node providers and installed within independent data centers. This motion proposal describes the technical and organizational advances required to guarantee a continued, sustainable network growth of the IC.
The objective is threefold
Decentralization: empower node providers to setup, monitor and maintain nodes independently.
Scalability: remove technical and operational bottlenecks that hinder the network to grow to millions of nodes.
Automation: remove errors and speed-up repetitive tasks to lower risks and reduce resources required for the ever-growing IC network
Compared with other blockchains, the IC has stronger requirements on the homogeneity of the node machines due to the fact that most of the machines’ resources are dedicated to perform useful work (e.g. executing smart contracts, participating in threshold cryptography) that needs to be performed by all nodes of a given subnet blockchain.
The IC is designed to host mass market applications completely on chain. Thus the state of canister smart contracts on the IC can be much larger than the state of smart contracts on other blockchains. The nodes also need significant network capacity for managing and synchronizing that state. To meet the requirements of extremely high availability and high throughput, the IC nodes are hosted in data centers (DCs). To achieve decentralization, the nodes are owned and the DCs are operated by independent parties. Moreover, the DCs are spread across the globe. Due to the need for homogeneity and capacity, the Internet Computer Association (ICA) defines the hardware to be used for node machines. The network is expected to grow to millions of nodes running from thousands of DCs in the coming years. As decentralization is key to the mission of the IC, many new, independent node providers (NPs) will be needed to support that growth. The current NP onboarding process, node deployment and management procedures are not ready to sustain this network growth mid- and long-term.
3. Why is this important?
A successful IC attracts thousands of developers who are continuously deploying millions of canister smart contracts. In order to sustain this growth, the number of nodes and subnets must steadily increase. However, this growth can only be sustained if the operational processes are further automated, ready to scale – all in a decentralized fashion.
4. Topics under this project
Specifically, this proposal includes the following research and development directions. The first topics are reasonably well scoped and work has started. The second set of topics is to be tackled mid- and long/term.
Autonomous node deployment and operation: Empower node providers to install and maintain nodes without any support from the ICA.
Independent node provider registration: New node providers shall be able to register directly in the NNS frontend dapp.They will be able to submit a proposal to become a new node provider, without the support of the ICA.
NNS-driven DC allocation: Node providers can participate in an NNS-controlled mechanism to elect new data centers. This process shall incentivise the addition of new DCs that further decentralize the IC’s network. Furthermore, it balances inflationary and deflationary forces.
Availability of node hardware: A new generation of ICA-specified node hardware is planned that shall guarantee global availability and provide a better choice between hardware providers. Furthermore, the next generation shall include the SEV-SNP technology to further improve the security of nodes.
Mid- and Long-term
Decentralized monitoring: Currently, the health of the nodes is measured by collecting and analysing logs and metrics on a third-party cluster. When deviating from the expected indicator values, the respective node providers and data centers are notified manually to fix the situation. In the next generation of the Internet Computer protocol, the monitoring tasks will be carried out by the nodes of the network themselves, in a fully automated fashion. To this end, the architecture and the protocol of the nodes and their components require extensions. In particular, it should be possible for any party to collect information on the health and contributions of any node without additional trust assumptions.
Decentralized backup: To ensure that even in the presence of a deadly bug in the node software the content of subnets is not lost, a backup mechanism shall be implemented to collect the state and the messages sent to a subnet off-chain. Using disaster recovery, the state of the subnet can be re-generated. In order to do this in a decentralized manner, mechanisms trading off replay speed, memory complexity, and privacy concerns have to be devised. Furthermore, the backup mechanism must be integrated in the governance protocol, including the assignment of which nodes are responsible for the backup of which subnets and their monitoring.
Decentralized boundary nodes: As elaborated in a separate motion proposal, boundary nodes will become fully NNS-managed and use the same update mechanisms and operating system as regular nodes. This will allow the inclusion of boundary nodes in the new node provider deployment process with slight variations in the configuration process.
Extended node remuneration: The previous items will lead to refinements and improvements of the node remuneration process. Insights gained through the decentralized monitoring will allow automated adjustments to payments based on the performance of nodes, or rather the lack thereof. Backup services and boundary nodes will be new categories to be included in the remuneration process.
Automated node and data center allocation and rewards: In a future version of the internet computer protocol a lot more tasks will be automated and executed by NNS canisters. In particular, the computation of how much rewards a node in a specific location should be granted, will be taking the current local costs for electricity, wages, regulations and risks into account and combine this with the measurements obtained from the decentralized monitoring infrastructure. The future versions of the NNS will be designed to reflect the dynamics of compute and storage demand and supply, integrate decentralization metrics and balance the amount of tokens in circulation.
Automated subnet creation and healing: With a continuously increasing number of nodes and subnets, it will become untenable to manually propose removal of unhealthy nodes and addition of new ones. Instead, the NNS will be extended to propose and execute node replacements and subnet creation based on data collected through decentralized monitoring.
Note that work described in the Motion proposals on Trusted Execution environments, Secure OS, E2E Security, Malicious parties, Scalability and Tokenomics is required to be integrated into the release and operational processes to achieve these goals. We will not describe or discuss them here in detail though and suggest questions regarding them to be asked in the corresponding forum threads.
5. Key milestones
The following milestones are indicative and their scope may change as the work proceeds.
(M1) Decentralized node deployments: This milestone subsumes the first short-term items and enables node providers to independently onboard and deploy their nodes.
(M2) Redeployment of existing nodes using new deployment mechanism
(M3) NNS-managed DC allocation: New DCs are selected by the NNS, possibly using an auction mechanism.
(M4) Operational maturity, including decentralized backup services and monitoring
(M5) Metrics-based remuneration: a refinement of the node provider remuneration process, taking the data collected through decentralized monitoring into account.
(M6) Automation: this is less of a milestone but rather a continuous effort.
Note: Boundary node milestones are described in a separate motion proposal.
6. People involved
Discussion leads: @Luis @yvonneanne @samuelburri
7. Why the DFINITY Foundation should make this a long-running R&D project
The design, implementation and rollout of an enhanced node management requires changes to many layers of the IC stack but also a close collaboration with various stakeholders, such as the node providers. While community members may have an excellent understanding of individual components or aspects, it’s unlikely that the community will drive this change end-to-end without a major contribution of the DFINITY foundation. For that reason and given the importance of the topic for a sustainable network growth, the foundation is determined to make significant resource investments in this project.
8. Skills and Expertise necessary to accomplish this
The problems described above require the cooperation of hardware, data center, and networking experts as well as software engineers, to design, review, and implement the prospective solutions. Specifically, experts from the following fields are necessary:
- Server design and configuration
- Networked systems
- Network management
- Network security
- Distributed systems
This project would require both researchers and software engineers with expertise in the above-mentioned fields.
9. Open research questions
- What are the relevant metrics to derive information about the health and correct operation of nodes? How can this information be collected reliably in a trustless fashion despite failures?
- How should subnet backup information be distributed, stored and retrieved from nodes while respecting privacy and guaranteeing efficient recovery despite Byzantine node faults?
- How should the rewards for boundary nodes and backup nodes be determined in relationship to replica nodes? Should there be a role shuffling scheme and if yes how should it be designed to balance fairness and efficiency while providing maximum security?
- How can the computation and implementation of penalties and rewards be automated in a fair and incentive-compatible way with minimal implementation effort?
- How to migrate from the current deployment of three different node types to a unified and more flexible deployment fully governed by the NNS?
- How to design allocation mechanisms that provide incentives for network growth maximizing decentralization along multiple dimensions? How to balance dynamism in local costs, risks and regulations with the need for long-term planning and the supply and demand of ICP overall?
- How to exploit the information about node health and subnet usage for the automatic derivation of suitable thresholds for node replacements and subnet creation?
10. Examples where the community can integrate into project
Node providers are the key stakeholders for the plans presented in this proposal. We would greatly appreciate their opinion on how to prioritize the work items listed above and whether we may have missed some important aspects. We plan to rollout first DCs with the new deployment mechanism in Q1 2022 and look forward to tangible feedback from these first trials. We plan to keep the community posted on this topic on a regular basis.
11. What we are asking the community
The forum discussion around the roadmap for node management has attracted a lot of interest and contributions from the community during the past days. We have included feedback that we have received so far in this summary. DFINITY engineers and researchers are looking forward to more inputs and discussions once this motion proposal is submitted.