Performance Based Node Rewards

TL;DR

The Internet Computer Protocol uses Trustworthy Metrics to evaluate node performance based on block creation success. It is proposed to penalize underperforming nodes by linking their reward to the Trustworthy Metrics, focusing on nodes with very high failure rates. It is suggested to roll out this feature in phases, beginning with a Proof of Concept to collect feedback from stakeholders, before the actual implementation of the reward reduction within the Network Nervous System.

Background & Goal

Earlier this year, the ICP protocol introduced “Trustworth Metrics for Useful Work” to enhance transparency in node performance. It exposes information on which nodes have succeeded or failed in the role of block makers. For more technical details on the design and tooling, please see here.

Building on this foundational work, it is suggested to link node rewards with node performance, with the following aims:

  • Node rewards should be dependent on the node performance:
    • Healthy nodes, being available almost always, should get full rewards.
    • Under-performing nodes should be penalized.
  • Avoid side effects on incentives (misreporting to harm other nodes):
    • It should be difficult for other nodes to influence node rewards for a particular node. A node should not directly profit from downtime of other nodes.

Trustworthy Metrics

To measure node performance, the Trustworthy Metrics evaluate how often a node contributes to protocol operations. This works as follows: In each protocol round, a node is selected to be the block maker based on the random beacon. For a node to successfully make a block, it must be up-to-date and fast enough, necessitating robust network connectivity along with adequate computing and storage resources. The Trustworthy Metrics counts the number of blocks a node was tasked to make versus how many it successfully made, providing a reliable indicator of a node’s contributions to the protocol.

The metrics are deemed trustworthy because they are generated and served directly by the Internet Computer, without any intermediaries. This design prevents any single node from misrepresenting its operational status.

The block maker success rate over a specific period (e.g. a month) is defined as the ratio of the number of blocks a node successfully made to the number it was supposed to make. Conversely, the block maker failure rate is calculated as 1 - block maker success rate.

Initial Analysis of Node Performance Metrics

In this section, we conduct an initial analysis of the Trustworthy Metrics, which have been collected from February 20 to April 18, 2024. Our aim is to explore potential regional differences in node performance that could stem from variations in network connectivity.

The graph below represents the average failure rate of nodes grouped by their respective data centers, sorted in descending order of performance. Note that not all data centers with low failure rates are shown.

From the graph, it is evident that most data centers exhibit very low failure rates and thus high reliability. However, a few data centers have significantly higher failure rates. Interestingly, there appears to be no consistent pattern linking failure rates to geographic locations, as data centers within the same city can perform quite differently.

Suggested Approach for Node Reward Reduction

The overall idea is to define a simple and transparent approach, linking node reward reductions to the Trustworthy Metrics in a straightforward way. Considering that the vast majority of nodes exhibit maximum failure rates between 0-5%, our focus will initially be on outlier nodes with failure rates exceeding 10%. This threshold may be tightened in the future

In more detail, this would work as follows: For every reward period of one month, the NNS determines for every node the average block maker failure rate during that period. Based on this, a performance reduction score from 0 to 100% will be applied on the monthly rewards of that node. The reduction score could be determined by a piecewise linear function connecting configurable red points, as shown in the below graph.

Example calculation:

  • A node with a failure rate up to 10% will incur no penalty.
  • A node with a failure rate of 60% will receive a 80% reduction in rewards.

Trustworthy Metrics can only be collected for nodes that are assigned to subnets. As a result, it is only possible to calculate the reduction score for nodes when they are part of a subnet. In the next step (still as part of this proposal), it needs to be defined how to handle rewards of nodes which are not assigned to a subnet.

Next steps

It is suggested to follow a phased approach to integrate the new node reward reduction approach:

Phase 1: Proof of Concept, Refinement and Calibration

  • The objective of this initial phase is to make all stakeholders familiar with the methodology and data underpinning the proposed approach.
  • To facilitate this understanding, a Proof of Concept has been implemented by @pietrodimarco, accessible [here]. This tool allows stakeholders to analyze the Trustworthy Metrics of nodes over time and observe the resulting node reward reduction scores.
  • We encourage community members to actively engage with this tool and provide feedback on its functionality.
  • During this phase, node rewards will not yet be impacted.
  • Towards the end of this phase, we plan to submit a motion proposal to the Network Nervous System to secure agreement on the approach.

Phase 2: Reward Reduction Implementation

In the subsequent phase, the focus will shift to implementing the automatic adjustment of rewards based on node performance directly within the NNS. The specifics of this implementation will be detailed in an additional design step.

21 Likes

Excellent ! If the node works as it should work it has its rewards, if not they are diminished. It’s a fair thing for those that work well. On the other hand as a secondary positive impact is that by decreasing the rewards it will also help lower inflation.

4 Likes

Brilliant step in the right direction, I am very excited for the increased robustness this will bring to ICP.

10 Likes

Is it really planned that nodes with 100% failure rate would still receive rewards, or was this just the example graph shown here?

2 Likes

The complexity of the IC is a true marvel. This is a much need evolution in our step to becoming a true world computer.

Bravo!!!

3 Likes

This is just the first iteration, not the final version. To finalize we need to:

  1. reach an agreement with the node providers
  2. get an approval from the community to proceed
  3. gain confidence into the code and the reward calculation

So I’d expect the final reward calculation to be at least somewhat different, based on the input from various sides.

4 Likes

This is great :+1:
Few questions :

  1. Will the performance-based metrics only affect nodes on the subnet? I’m assuming the pending nodes will not be affected on this calculation?
  2. If there are nodes that are failing shouldn’t there be implementations of how improvements can be made and diagnose procedures? The reason for this is quite tricky as IC nodes are not accessible from the NP end but rather from outside monitoring tools. And using the metrics to detect anomaly which leads to missing blocks
2 Likes

The current design only describes the handling of nodes assigned to subnets (because only for those we collect trustworthy metrics). As mentioned above, in a next step (but still as part of this proposal) we need to define the handling of unassigned nodes.

Just to be sure: How precisely do you define a pending node? Would that be a node that is currently onboarded ?

Yes totally, further tooling on metrics would be very useful to monitor nodes. I believe that @ritvick is currently looking into this.

1 Like

To clarify: when we refer to “pending nodes,” are we talking about nodes that have been onboarded but not yet assigned to a subnet. My main concern lies with the rewards system for these unassigned nodes. While nodes assigned to subnets earn rewards based on performance, we should also consider how we handle nodes that are standing by, waiting to take over in case of failure.

Currently, ICP can handle a 50% reduction in nodes without significant issues, but this is edging closer to a potential risk zone. It’s important to remember that the operational costs (OPEX) and Capital Costs of non-active subnet nodes, even when they’re unassigned, are essentially the same as those of actively utilized nodes.

My suggestion is to find an alternative reward model(reward reduction model) for these standby nodes.

ICP (Dfinity Team) has always stood out in how it supports its node providers (NPs) and investors, offering a level of assurance and contribution that surpasses other blockchains. This commitment is one of the things I admire most about the network. Are there any ongoing discussions or plans to adjust the metrics or reward models for these standby nodes to prevent potential destabilization?

2 Likes

Thank you for clarifying, @MalithHatananchchige! The next step in this proposal involves defining how to handle rewards for nodes that are not assigned to a subnet. We are about to start this discussion, and we will use both this forum thread and the node provider working group to facilitate the conversations.

I agree with your point that we need to find a somewhat alternative reward model (given that the above approach relies on trustworthy metrics which are only available for assigned nodes).

1 Like

Hi all,

Following up on the topic of performance based node rewards, I would like to share suggestions on the handling of nodes which are not assigned to a subnet.

Background & Goals

The overall aim is to link node rewards to performance on the Internet Computer Protocol. Trustworthy metrics (block maker success rates) can be used to measure performance for nodes assigned to subnets. Unassigned nodes lack trustworthy metrics as they do not participate in the protocol.

Goals:

  • Ensure node providers are incentivized to maintain operational standards for all nodes, assigned or not.
  • Nodes which are completely offline, should face higher penalties compared to online nodes (or even no rewards at all).
  • Focus should be on getting incentives right, rather than reducing the overall inflation.

Analysis - Overview of Node Health Status

As of September 26, 2024, data from the IC dashboard show a total of 1402 nodes categorized as follows:

  • UP (assigned to a subnet),
  • UNASSIGNED (online but not assigned to a subnet),
  • DEGRADED (online with at least one monitoring alert), and
  • DOWN (offline).

Initial observations:

  • 40% of nodes are assigned to subnets, with the remaining 60% being unassigned, degraded, or offline. Please note, that the assignment ratio would increase with the creation of additional subnets as recently suggested in this post.
  • In total 1465 nodes get rewards and thus there is a delta of 63 to the nodes in the registry/dashboard. This means, 63 nodes which are not in the registry receive rewards.
  • Almost all node providers have at least one node being assigned.

The following four charts show a breakdown of node health status by node provider, sorted by node count.

Suggested Measures

Measure #1: Unregistered nodes

Proposal:

  • Nodes not listed in the current registry cannot be assigned to subnets, contributing no value to the protocol.
  • Hence, it suggested that nodes which are not registered at all in a given reward period should not receive rewards.
  • This could be implemented via an automated check at the time of the reward calculation.

Impact:

Currently at least 63 nodes receive rewards but are not registered (4.3% of 1465). Assuming an average reward rate of 1.5k per node, this would lead to an overall reduction in node rewards of 94K XDR.

Measure #2: Performance extrapolation

Proposal:

  • The current node assignment ratio is 40% and thus we cannot have all nodes assigned, but at least one/some node per node provider.
  • In case that a node is assigned and unassigned during a reward period, then the node would receive a score derived from the assigned period.
  • The average performance score of a provider’s (partially) assigned nodes will be applied to all their completely unassigned nodes.
  • For providers with four or more nodes, if none of their nodes are assigned, a flat penalty of 80% (reward multiplier of 0.2) will be applied. This penalty matches the maximum penalty for assigned nodes.

Impact:

The precise impact is to be estimated. Given the leniency of the reward penalty function, it is anticipated that most node providers will not be impacted.

Example Calculation:

Alternative to Measure #2: Malus/Bonus System (not suggested at the current stage)

Proposal

  • Currently, 40% of nodes are assigned to subnets, with the remaining 60% being unassigned, degraded, or offline.
  • As a starting of a malus/bonus system, we make the assumption that the total reward pool size should remain unchanged. Based on this, one could apply a 60% bonus for assigned nodes and a 40% malus for others to maintain reward balance. One could also use alternatives with the same ratio, e.g. 30% and 20% (or 15% and 10%).
  • The approach could be implemented through the already proposed linear reward function for assigned nodes, allowing for a reward multiplier greater than 1 for nodes with high availability.

Assessment
This approach represents a broader modification to the node reward system. While it positively integrates the benefits of decentralization into reward calculations, it may also trigger numerous competing node assignment proposals due to incentives for having many nodes assigned. Therefore, while it could be beneficial, such a change might be better introduced as part of a more comprehensive overhaul at a later stage. Nevertheless, it is worth sharing this idea at this point for completeness.

Next Steps

We invite the community to provide feedback on the suggested measure #1 and #2 (and comments on the mentioned alternative to measure #2). In addition we suggest discussing this topic in the next node provider working group meeting.

3 Likes

Yes, I am also looking forward to the next call of the node provider working for a detailed discussion!

Just to clarify, the proposal does not involve only paying assigned nodes. A node provider would receive full rewards for all their nodes if the following conditions are met:

  • All nodes are registered.
  • At least one node is assigned.
  • Assigned nodes display good performance, characterized by a block maker failure rate of less than 10%.
1 Like

@bjoernek - Just a quick question regarding Measure#2.

You propose a different multiplier for each individual node based on that specific nodes performance that gets applied to that node’s rewards. The issue is just that all the underlying rewards calculation is different for every GEN2 node for a node provider in each country.

There is no individual mapping between a specific node in the registry and the rewards associated to that specific node.

Will the penalty be applied to the average reward per node for each NP in every country they operate in?

So in the example above 10374 / 6 = 1729 per node before penalties?

1 Like

@bjoernek, to clarify is this NNS proposal only to monitor rewards for assigned nodes. There is no change in rewards for unassigned nodes as per the goal. Later, as you mentioned, there are 3 methods to tackle unassigned nodes in the future, which will be a subject of discussion at the next NP meeting?

The unassigned nodes play an important role beyond just being on standby; they are a vital resource for scaling ICP to meet the growing demand for application canisters. With the current usage burst on the application subnet :confetti_ball: :tada:. There is a request to nearly double the capacity of the application subnet, unassigned nodes are essential to absorb this growth surge, ensuring that ICP can handle increased demand smoothly and continue its upward trajectory when it happens. It’s important to have standby nodes/unassigned nodes to make ICP ready to cater to higher usage demands at any time whenever it happens.

@Lerak Thank you for picking up this aspect, which I had forgotten to mention in the post. Indeed, for generation 2 node providers, the reward calculation applies a portfolio view and hence it would make most sense to apply the performance reward multiplier to an average.

Yes, I fully agree with this statement. Unassigned nodes play indeed an important role as they serve as a buffer for the growing demand. And I would add to that: In order to make sure that unassigned nodes are ready up & running, so that they can be used when needed, we need to create the according incentives.

No, this proposal contains an incentive scheme covering both assigned & unassigned nodes. See the according statement in the very first post in this thread

“Trustworthy Metrics can only be collected for nodes that are assigned to subnets. As a result, it is only possible to calculate the reduction score for nodes when they are part of a subnet. In the next step (still as part of this proposal), it needs to be defined how to handle rewards of nodes which are not assigned to a subnet.”

To illustrate this further: If a node performance scheme were applied only to assigned nodes, while unassigned nodes continued receiving full rewards, this could incentivize node providers to have only unassigned nodes. This arrangement might even encourage poor performance for assigned nodes deliberately, as underperforming nodes would be swapped out of subnets, yet still receive full rewards thereafter.

1 Like

cc: @dfisher
Thanks for replying @bjoernek, :handshake: I agree with Measure #1 however, I see two issues in the above Measures #2 and Alternative to Measure #2

:warning: Issue #1 Assigned and unassigned nodes serve different purposes in the ecosystem.

Assigned nodes provide computing power and unassigned nodes provide on-demand scalability.

The goal of this proposal is to link rewards with performance to encourage the node providers to improve the health of their nodes to contribute to the ecosystem effectively. But we can’t use the same metrics we use to measure the performance of an actively computing node to measure the contribution of an unassigned node that’s waiting to handle the work when it is needed to make sure those nodes are contributing to the ecosystem.

:warning: Issue #2 The assignment of nodes is defined by the DRE tool

Both measures are defined assuming that a node being unassigned has to be penalized. Yet nodes are assigned by the DRE tool and it is out of the node provider’s control, which makes it unfair for the node provider to get penalized for not being assigned since regardless of whether nodes are assigned the CAPEX and OPEX of nodes are similar, and will be decommissioned within 4 years.

Suggested approach.

Consider two types of nodes separately and come up with two different approaches to penalise the rewards.

For unassigned nodes; we have to implement new metrics to make sure the nodes are active and ready to participate in the network anytime. And penalise the node according to those metrics. There are multiple ways that this can be achieved,

  1. Heartbeat and uptime monitoring,
  2. Test Block Production in a Simulated Environment
  3. Deploy Unassigned Nodes in Testnet/Devnet Roles.

For assigned nodes; :handshake: I totally agree to use the Trustworthy Metrics for Useful Work and penalise them accordingly as you have mentioned in the previous post.

1 Like

You are touching on an important point. It would probably be useful for node providers having the ability to play a more active role in node assignments. For example, if some kind of repair/maintenance is scheduled for one node, it would be great if the node provider could submit a proposal to swap out that particular node. This sounds like good topic for the node provider working group where we could discuss further.

I agree that directly measuring the performance of unassigned nodes would be more precise than the extrapolation approach suggested above. However, developing a new metric for this purpose would lead to significant challenges. Ensuring its fairness and security against tampering is critical to avoid disadvantaging certain node providers. This would necessitate a consensus mechanism, which essentially brings us back to needing some form of subnet assignment in order to have trustworthy metrics.

1 Like

I totally agree. We need both assigned and unassigned nodes since they serve different purposes.

The challenge is the following.
For the assigned nodes we have “trustworthy metrics” that we obtain from the consensus which makes them trustworthy, and we can base rewards on them.

OTOH, for the unassigned nodes (nodes that are not in a subnet), there is no consensus. So either we invent a new consensus, which is a hard problem to solve, or we base rewards on something else. Let’s say we base rewards on something else.
If these rewards are fixed in amount, and say 100% of the max amount, then everyone will simply want to have nodes unassigned.
If the rewards are less than 100% of the max amount, some people will necessarily complain. My original suggestion was to make this amount say 80% of the max reward amount. But then, as you say, it’s not easy for a NP to influence whether a node will be in a subnet or not. That depends on the number of subnets, decentralization, node health, and many other factors. So losing 20% (or more) of rewards would not be seen in positive light by some NPs.

The proposal shared by @bjoernek is a compromise: give out 100% of rewards for unassigned nodes, but only if we have good indication that unassigned nodes are healthy and can be assigned to a subnet if needed.
For instance, if a node is unhealthy and due to this removed from a subnet, it’s an indication that a node is unhealthy and wouldn’t be valuable in another subnet either so shouldn’t get full rewards. If a node is later fixed in some way and joins another subnet and gets 100% of rewards and then gets removed for decentralization or some other reason - it would still get full rewards since the node is completely healthy.

We would primarily like to achieve the following: introduce a (financial) incentive to keep unassigned nodes healthy and ready to use if needed, without overpaying for unhealthy or otherwise unusable nodes.

Any suggestions on how to achieve that with the least amount of work possible would be more than welcome!

2 Likes
  1. we were thinking about this one as well and it might be a good approach. We just need a reliable (if possible indisputable) way to show that a node is healthy or not. Note that a node that is “up” may e.g. still have slow disk/cpu/network, which makes it unsuitable for actual use. So this wouldn’t be sufficient. We need something more + a consensus across multiple independent data sources.
  2. not sure how that would work, but it doesn’t sound very easy to do; how to make this data trustworthy?
  3. we don’t have a testnet/devnet at the moment so that would be work for someone and I haven’t found sufficient support for doing this work, and also rewarding unassigned nodes based on some other IC deployment doesn’t sound very straightforward (e.g. who controls the nodes in the testnet/devnet? if it’s fully decentralized then we have 2 ICs to manage == clearly more work, and if it’s one controlling party then we can’t trust data coming out of it)
2 Likes