Performance Based Node Rewards

MalithHatananchchige · October 29, 2024, 4:21pm

@bjoernek @sat Thank you for replying, I see that it’s not possible to have a comprehensive consensus ready for unassigned nodes. We can drop this one out. Your reply is enough for me to understand that it seems
Measure #2 is the best so far and I agree given the current situation. and
goodjob coming up with this strategy @bjoernek I will explain the scenario here so to check if we are on the same page?

Example rewards are calculated based on measure #2

As per example 3 Assigned nodes and 3 Unassigned nodes

Assigned Nodes Performance Score:
The performance scores of the 3 assigned nodes will be averaged. Let’s assume each assigned node has a different performance multiplier
(e.g., Node A: 0.8, Node B: 0.6, and Node C: 0.7).
The average score would then be
(0.8+0.6+0.7)/3=0.7
Extrapolated Score for Unassigned Nodes:
This average performance score (0.7 in the example) will be applied to each of the 3 unassigned nodes, meaning that they will each receive a reward multiplier of 0.7 for that period.
Total Reward Calculation:
Assuming a base reward, each assigned node would receive its specific score multiplier applied to the base, while each unassigned node would receive the average multiplier of 0.7

bjoernek · October 29, 2024, 4:49pm

Thank you for reviewing & challenging the proposed approach in detail @MalithHatananchchige, very much appreciated! (And credit to @sat who actually came up with the extrapolation idea).

Regarding your walked example: The calculation is spot on.

wpb · November 26, 2024, 1:33am

@bjoernek @sat

I don’t understand what is preventing you from creating one big standby subnet where all “unassigned” nodes (today) are assigned so they can perform some consensus verified computing task that enables collection of trustworthy metrics. This doesn’t even need to be for a short duration each month. It could be a continuous assignment until a node is needed in any of the other subnets. The purpose of this subnet would be to continuously validate the performance capability of each node and to ensure that every single node in the internet computer has trustworthy metrics.

If I were a node provider (perhaps some day), I would not like the risk associated with applying a penalty based on a few of my nodes to all of my nodes that are unassigned in standby. I don’t see how we can argue that the trustworthy metrics of those assigned nodes can represent all the unassigned nodes. There are so many things that can go wrong with any one node. When you apply the performance statistics from a small data set to a larger population then you have high risk of amplifying the error beyond what may be reality. Node providers don’t get to choose which nodes are deployed. There are some node providers with 20 - 70 nodes that may only have 2 - 4 assigned nodes. A problem that arises with one of those assigned nodes would have devastating consequences. Or perhaps there is a smaller node provider who only has 5 - 10 nodes, but obtaining those nodes required a loan that is still getting paid off. If only 1-2 nodes are assigned and then 1 goes down, the penalty applied to all of their nodes could put that node provider out of business even though they have 4 - 9 nodes that are working perfectly fine.

It seems like a much more appropriate approach would be to create a standby subnet where the currently unassigned nodes are assigned to perform some sort of consensus task that is representative of how they would perform if they were assigned to a real subnet. I don’t really understand why we have unassigned nodes at all, especially 60% of all nodes that are getting rewards. I understand the desire to have optionality for scaling, but can’t that be accomplished by taking nodes from the standby subnet when any subnet size is increased or new subnets are created? Is there anything preventing the standby subnet from scaling consensus based on the number of nodes that are assigned to it? I would argue that nobody cares what the latency is for the standby subnet, so that could get very long since it would have so many nodes. Regardless, all nodes would have trustworthy metrics and would be getting paid for performing work full time.

It seems there would be much less risk for node providers if you created a standby subnet and then applied a very straightforward penalty to the trustworthy metrics that are generated for every node individually. DFINITY could then think more aggressively about how to adjust penalties so underperforming nodes are penalized heavily…perhaps a node reward reduction should be 100% if it’s failure rate is above 25% and linearly decrease to 0% reward reduction at 0% failure rate. Surely that would still incentivize a node provider to address issues quickly without unfairly penalizing them for poor performance based on unvalidated, extrapolated data from a small data set.

bjoernek · November 26, 2024, 9:38am

Hi @wpb, the idea of creating an additional subnet to evaluate the performance of unassigned nodes is valid & interesting and has been discussed before. However, several concerns were raised during those discussions:

Artificial Load: This subnet would require artificial load to generate useful metrics, which raises the question of how this load should be created and managed.
Increased Proposal Management: Introducing an additional subnet could increase the frequency of node swapping proposals (both in and out), requiring enhancements of the tooling used for this purpose.
Recent Network Expansion: The recent NNS decision to add 20 additional subnets has increased the number of nodes that can be assigned, reducing the urgency to manage unassigned nodes in a more sophisticated way.

In summary, while the idea has merits, it would introduce significant additional effort. As the need for such a system diminishes over time, the question is whether the benefits outweigh the costs.

cc: @sat & @pietrodimarco in case there are additional arguments that I might have missed.

bjoernek · November 26, 2024, 10:17am

Hi all, here are brief minutes from the discussion on performance based node rewards in yesterday’s node provider working group:

The overall approach for the handling of performance based node rewards was deemed adequate. However, there was a discussion on a couple of edge cases, which require further fine-tuning:

For the onboarding of new node providers, it was proposed to implement a grace period of one or two months, during which no potential performance based reductions in rewards would be applied.

It was clarified that for Generation 2 node providers, a reduction coefficient is applied to the average reward rate of nodes of a node provider within the given geographic region. For more details, refer to the tool source code linked [here].

We discussed potential enhancements to the methodology used when multiple nodes within a subnet show degraded performance, potentially due to issues beyond the control of node providers. One suggestion was to analyze the difference between the average (or median) performance of all nodes in a subnet and that of an individual node. This topic requires further investigation.

It was also discussed that within any given reward period, node performance metrics should only be considered valid if trustworthy data for those nodes have been collected for at least a few days (e.g., five days).

It was discussed whether an artificial subnet could be created specifically for measuring the performance of previously unassigned nodes. However, this approach would entail significant operational work. It was noted that the recent approval to onboard 20 additional subnets will help to reduce the number of unassigned nodes.

wpb · November 26, 2024, 12:22pm

I agree, but DFINITY is full of smart engineers who I’m sure can come up with a reasonable artificial load to run a standby subnet. This doesn’t strike me as a reasonable argument against creating a standby subnet given that the proposal is to severely penalize a potentially large population of unassigned nodes for each node provider based on performance of a small data set of their nodes that the DRE tooling has selected.

I agree, but there are now people and organizations such as CodeGov (@ZackDS @timk11 @LaCosta) and @Lorimer who are receiving grants to review these proposals. They are helping with this workload. Unfortunately, the grant size seems too small already for the workload and this would make it worse, but the grants for voting neurons are still experimental and will also need to be right sized in the future (IMO). I would argue that more people should be allowed to perform these proposal reviews as well. Indeed the tooling would also need to be enhanced. So while I agree all of these things are true, it still doesn’t strike me as a reasonable argument against creating a standby subnet. I would rather see penalties for unassigned nodes delayed until these things can be addressed.

Currently 60% of all nodes are unassigned. Once these additional 20 subnets are created, what percentage of the nodes will still be unassigned? More importantly, is it likely that individual node providers will end up with more than 25% of their nodes still unassigned? If so, then I still think it would be unfair to penalize unassigned nodes based on performance of the assigned nodes. It would be applying performance statistics to a large population of nodes that cannot be verifiably proven to be underperformers.

In my opinion, yes the benefits out weigh the costs. I fully agree with the idea of performance based node rewards. If you can verifiably prove that a node is underperforming, then it should be heavily penalized. I just don’t agree with extrapolating that penalty to nodes that you can’t prove are underperformers.

This is important to me today because I hope to become a node provider some day. In fact, I hope that anyone who is actively participating in the governance activities on technical proposals, building critical infrastructure for the protocol, or developing significant apps for the ecosystem and have made it through an SNS will some day have an opportunity to become a node provider since it can be a source of sustainable income in a small growing ecosystem. In all cases, I can imagine these being smaller node providers who are required to launch their nodes in remote locations relative to their home location (due to future decentralized geographic topology targets), which strikes me as higher risk that issues beyond their immediate control could really impact remuneration rewards. Hence, even though I’m not currently a node provider, the decisions made on this topic could have significant consequences on future node providers. I just think it’s worth slowing down on the application of penalties to unassigned nodes until the infrastructure can be put in place to base those penalties on actual performance metrics for those nodes.

I’m not sure if you are saying that there was agreement on the overall approach inclusive of the application of penalties to unassigned nodes, but it didn’t seem like there was agreement on this detail to me. I’m not sure the issues brought up are really just edge cases. The node providers present seemed quite concerned to me, which is why they presented so many examples. It seems this issue will affect many node providers and in many cases it could have a big impact randomly even if they are historically good performers. Is it possible to simulate the proposed penalty (inclusive of the extrapolation of the assigned node penalty to unassigned nodes) for all node providers for the last 3 months to see how each individual node provider would have been impacted each month? Perhaps going through that exercise would reveal in hard dollar terms how each node provider would be impacted. I agree that node providers did not put up any resistance to penalties for nodes where trustworthy metrics exist. The issue is just whether or not it is fair to extrapolate those penalties to unassigned nodes where trustworthy metrics don’t exist.

Anyway, I’m not a node provider today, so I definitely yield to the opinions expressed by other node providers. I very much agree with performance based node rewards. I would just rather see performance metrics actually collected for all nodes that will be penalized instead of making unverified assumptions about performance of unassigned nodes. If that means significant operational work to make it happen, then so be it. DFINITY doesn’t have to take on all the burden when node providers are already incentivized to perform some of the required tasks and community reviewers can be properly incentivized to help as well.

sat · November 26, 2024, 6:15pm

Oh this is surely technically possible. However the cost-benefit is a unclear and I’m not convinced it’s worthwhile.
Here is what needs to be done, in more details:

Create such a big evaluation subnet and create artificial long running workloads. Short-running workloads would not create sufficient impact on the trustworthy metrics since we only keep track of daily stats.
Creating artificial long-running workloads that exercise a) computation, b) network, c) number of canisters, and d) storage is not a trivial task. Surely can be done, but is not trivial.
If we don’t exercise all dimensions then node performance in this evaluation subnet will not be as representative as the real performance they will would have in a real subnet
Next, assuming that we have this big evaluation subnet, we also need to think about the decentralization of the subnet. We currently don’t have a way to make cycles untrusted - cycles are cycles in the current implementation, regardless of where they come from. So if this subnet is not sufficiently decentralized then it could theoretically mint an arbitrary number of cycles and send to other subnets. This can also be prevented with work/development, but we don’t have support for “untrusted subnets”.
Next, we need to move nodes between the big evaluation subnet and the useful subnets. We currently only have tooling for changing subnet membership of a single subnet. So we would have to submit 2 proposals: 1) remove the nodes that we want to add to a real subnet, and 2) add the nodes. This means 2x the proposals, 2x the work for everyone, and 2x the delay in topology management. Unless we want to change the tooling. Which would be … you guessed it: development
Finally, although possibly a bit less of an importance but still… somewhat important: servers under load consume more electricity than servers without load. Roughly 2x, observed on some of the existing nodes. Without a good reason to consume that, it’s a bit wasteful.

So yes, it could be done. It’s cost vs benefit.
I don’t see a fundamental problem with extrapolation, although I admit it’s not a perfect representation of the performance of all nodes in a particular DC.

This actually goes both ways:

DC with “bad” (underperforming) nodes, where some of these bad nodes go into idle subnets and show up as “good” nodes
DC with “good” nodes, where some of these good nodes go into extremely busy subnets and show up as “bad” nodes

(other combinations of “bad nodes to busy subnets” and “good nodes to idle subnets” can be seen as non-concerning, so do not need to be discussed)

So, arguably, a NP can be unlucky to be in the 2nd group, or lucky and in the 1st group. But statistically, these two should cancel each other, so every NP has a decent chance (75% of cases) of getting appropriate rewards. Not ideal, but not too bad either.
Additionally, this is also something that the NP can improve upon. Every NP can submit proposals to add their nodes to subnets (the tooling is open source), and if that improves decentralization, the community should accept these proposals. So by having nodes in preferable locations or with other preferable characteristics (such as stability), NPs can increase the number of nodes in subnets. And with this, reduce the risk of getting incorrect performance extrapolation.

Sorry for this long post, but seems like you care about the topic so wanted to share my view in details. Thanks for reading this far and hope to get great feedback (as usual from you).

sat · November 26, 2024, 6:25pm

Also, I do expect that in medium term the number of unassigned nodes goes down to under 20% of all nodes. I see little benefit from having higher percentage of unassigned/unused nodes.
Can we get there today? Yes - we can create more subnets. There is actually an adopted motion proposal to create up to 20 more app subnets. That’s 20 * 13 nodes = 260 nodes. Not that much, but why do we need to stop at 20? Unless the IC stops growing or even starts shrinking. However, in that case we have bigger problems - right?

Also FWIW, the reason we didn’t create more subnets earlier is because we have no sensible way ATM to delete subnets (canisters would have to be moved between subnets or we need subnet merging and we don’t have support for either of these yet). So we have been conservative about creating new subnets if there is no strong reason to. We will still proceed slowly with subnet creation, but there is active planning for work on canister migration. So I’d expect things to look quite different in the future.

Lorimer · November 26, 2024, 7:44pm

I personally think @wpb has some very good points. I very much liked the sound of a standby subnet, and also very much appreciated @sat’s detailed breakdown of the challenges.

I wonder if two birds can be killed with one stone here. Another problem that exists is rigidity of the IC Target Topology. I believe it would be much more useful to have a target topology expressed as a set of tolerances (optimal configuration on one end of the tolerance, and the worst allowed on the other). Technically, if there are nodes sitting around doing nothing, they could be contributing to better decentralisation metrics for subnets (perhaps prioritising the most critical subnets, such as the NNS). Whenever a node is down on one subnet, an up node could be swapped out of an active subnet. This would reduce the number of nodes in the donor subnet, but that’s fine if it’s operating on the preferable end of the target topology tolerances.

Updating the tooling to make more efficient use of nodes would help the IC get more out of the nodes available. This will surely provide more options for mitigating growing pains as IC popularity continues to increase.

My understanding is that subnet membership mostly comes down to a registry entry. Do you really need to swap a node out of one subnet first before you can swap it into another?

wpb · November 26, 2024, 10:53pm

Hey @sat. Thank you for your responses. They are very helpful. I started to respond to all your points. However, I changed my mind by the time I got around to running a calculation on how the penalty would apply to one of the existing node providers. That exercise made me realize that the extrapolation may not be so bad in practice due to the reduction factor averaging that is proposed and the fact that all node providers already have enough deployments with trustworthy metrics to average out one or two bad performing nodes. The fact that 20 new subnets are already planned completely eliminates my concern about the extrapolation proposal. Since you said you were hoping to hear my feedback to your comments, I went ahead and preserved them in the summary below. If you read them, please note that I’m not as concerned are originally stated.

Original feedback summary

From my perspective, a standby subnet should be designed to produce the trustworthy metrics data. So whatever that takes would be fine.

Why would a standby subnet with 300 (20%) to 900 (60%) nodes not be sufficiently decentralized? Would this subnet really need to be decentralized geographically? Otherwise, aren’t there plenty of node providers, data centers, etc represented in nodes operating in this subnet to offer reasonable decentralization? It was good enough for the first several years, so why wouldn’t it be good enough for the next several years? Also, wouldn’t the only computation being performed come from DFINITY as part of the artificial workload? I trust DFINITY to not mint cycles and send to other subnets.

Node providers would be motivated to submit these proposals themselves. If a standby subnet exists, then a new rule could be created that any node that is not in a subnet will no longer receive remuneration. Hence, they would be self motivated to learn how to submit these proposals. On the review side, people like @Lorimer, the CodeGov team (@ZackDS @timk11 @LaCosta), and many others who applied to become reviewers for the Subnet Management proposal topic for the Grants for Voting Neurons would likely be happy to review all these proposals with proper incentives. It would be an awesome way to get more people involved.

Ironically, I consider this to be the most valid reason for not creating a standby subnet. I highly appreciate the focus that DFINITY has always put on energy efficiency. However, we are talking about major hard dollar penalties to node providers for things they cannot easily control. What should be a relatively small penalty for one node will become very big penalties due to the amplification effect of extrapolating a small data set to a large population based on the assumption that the same percentage of nodes will perform poorly when one or two perform poorly.

I’ll try to apply the penalty proposed here to an example from the Node Status by node_provider_name charts provided by @bjoernek earlier in this forum thread. Looking at 87m Neuron, LLC, it appears that they had one node DOWN. According to the dashboard, they own 48 nodes and only 6 of them are deployed in subnets today. Let’s assume these nodes 6 were UP at the time the data in the chart was collected. If the node that was DOWN was out for 2 weeks, then it would receive roughly 65% reward reduction. Since the nodes owned by this node provider are all located in data centers in the US (not California), the rewards they will receive when these penalties go into effect will be 1004 XDR per month. Hence, the penalty for that one node is approx 653 XDR. However, extrapolating the penalty to the 41 unassigned nodes means that the reduction factor that is applied to unassigned nodes is r = ((100-65)+100*6)/7 = 0.907. This means instead of losing just 653 XDR, they will actually lose 654 + (1-0.907)100441 = 4482 XDR, which is approx $5827 USD penalty. So in this case, they would only receive 90.1% of their max remuneration because of the poor performance of 1 nodes. Having 7 nodes in subnets to average out the poor performer definitely helps. I see that this averaging effect greatly diminishes the impact of one or two bad performing nodes for all node providers. So perhaps this extrapolation is not too bad. I’m convinced. I stand corrected.

Personally, I would like to see us maximize the use of nodes that are getting paid. Hence, I would advocate for creating more than 20 new subnets.

If DFINITY expresses an interest in this idea, I would love to see you expand on it further. I think of the target topology as a goal, not necessarily a rigid requirement. Hence, anything that enables flexibility such as your tolerances idea seem very reasonable to me. It would be cool to see you have an opportunity to further scope and/or implement tooling along these lines.

sat · November 27, 2024, 7:36am

Fantastic ideas in this topic - I love it!
For the standby subnet I think that in tests we only went as far as having 150 nodes in a subnet. Beyond that there were some technical limitations due to storing subnet configuration in the registry. I’ll check with the engineers if that’s still the case. So we’d have to create a few of them. But it could still work.

@Lorimer for swapping nodes across subnets, there is a registry configuration that would have to be changed in a different way (I’d estimate that to 2-4 weeks of work, including testing etc) and there is old subnet state that is on the node and will be there when the node joins the new subnet. I would expect this to “just work” with the existing code but it’s hard to say if we would hit some weird edge case with large subnets, threshold keys, or who knows what. Right now there is a long delay between node leaving a subnet and joining a new one. With the other proposal and the new type of registry changes, the node would immediately change the subnet, while it still has the state of the old subnet.

Anyway, I’ll ask around to double check the viability. I wouldn’t say it’s impossible, but it seems to me like a lot more work than crunching some numbers for the extrapolation approach. In fact that work was completed in a few days. We can also add some additional tolerance if we see that it helps to make results more acceptable. And if we can’t get it to work, we can always go the artificial subnet way as a fallback. The nice thing is that the approaches are compatible.

pietrodimarco · December 3, 2024, 9:51am

Following up on the discussion in the November Node Provider Working Group, we have been analyzing how we can address the impact of high subnet load and protocol changes on node performance. Specifically, based on the discussion in the meeting, we have explored how to differentiate systematic failure rates (affecting all nodes in a subnet) from idiosyncratic failure rates (specific to individual nodes).

Subnet Failure Rate Analysis

Since April 2024, subnet failure rates (FR) have been analyzed.

We compare the median and 75% quantile of the daily failure rate of nodes within the subnets w4rem, fuqsr for the time period April to November 2024.
Most of the time these measures are at very low levels, i.e., below 2%.
In October, we observed a systematic increase in the subnet failure rate for all investigated subnets. This is reflected in an increased median and 75th quantile.

Suggested Methodology for the Determination of the Node Failure Rate

Following the discussions in the Node Provider Working Group and the analysis presented, it is recommended distinguishing between systematic and idiosyncratic node failure rates. We propose that only the idiosyncratic component of node failure rates should influence reward multipliers. This means that a node would be penalized only if its performance significantly deviates from the performance of its peer nodes within the same subnet. This approach can be detailed as follows:

Systematic Failure Rate

Calculate the 75th percentile failure rate daily for each subnet to account for systematic factors such as protocol changes or high load of subnet.
This provides a mapping: DAY -> SUBNET_ID -> SYSTEMATIC_FR

Idiosyncratic Failure Rate

To isolate the idiosyncratic failure rate for a node, compute the difference between a node’s daily failure rate and the subnet’s systematic failure rate.Apply flooring to avoid negative values: Idiosyncratic Failure Rate = max(0, Node Failure Rate - Systematic Failure Rate)
Only the idiosyncratic failure rate is then used as an input for the calculation of the reward multiplier.

Example Calculation:

Example 1:

This example shows a node on the k44fs subnet whose rewards would have been adjusted in the prior approach considering the absolute node failure rates. However, since the failure rate is systematic, no adjustment was applied.

Considering both systematic and idiosyncratic components the failure rate is 12.89% which corresponds to a rewards multiplier of 96.6%.

Rewards XDR calculation:

Base monthly rewards XDR: 1584
The idiosyncratic failure rate for a node in a reward period is computed averaging the daily idiosyncratic failure rates.
In this example the idiosyncratic failure rate is 3.02%. The node will be rewarded fully without adjustments.

Example 2:

This example shows a node on the w4rem subnet whose rewards are adjusted significantly since the failure rate is idiosyncratic, i.e., exceeding the 75% quantile of the failure rates of the subnet.

Considering both systematic and idiosyncratic components the failure rate is 36.02% which corresponds to a rewards multiplier of 70.2%.

Rewards XDR calculation:

Base monthly rewards XDR: 2157.25
The idiosyncratic failure rate is 35.14% which corresponds to a rewards multiplier of 71.2%.
In this example, the systematic component is minimal, as most of the other nodes in the subnet have performed well during the period.
The node is rewarded 2157.25 * 71.2% = 1536 XDR.

pietrodimarco · December 3, 2024, 3:53pm

Performance Extrapolation for Unassigned and Partially Assigned Nodes

Another topic presented during the Nov. meeting was the methodology for calculating rewards for nodes that are entirely unassigned and those that are partially unassigned during a given reward period. The current proposed algorithm for partially assigned nodes extrapolates their performance during active periods to estimate their performance during unassigned periods. For fully unassigned nodes, the algorithm instead extrapolates the performance of assigned nodes (fully and partially assigned) to estimate the performance of those that are entirely unassigned. One artifact of this methodology is the example node provided at the following link: Node Example by @tina23 .

In this case, the node was assigned to a subnet for two consecutive days before being removed on the third day due to a high failure rate. This resulted in a low reward multiplier, as the failure rate during the assigned period was extrapolated to the unassigned period under the current methodology. Consequently, the node’s low performance during the assigned period adversely impacted its rewards.

One concern raised about this methodology is that it may unfairly penalize nodes that didn’t have sufficient time to recover. For instance, a node might have been healthy during the unassigned period, yet its extrapolated performance suggests otherwise, resulting in a low failure rate assignment.

To address this issue, one proposed solution was to establish a minimum number of assigned days required for a node to be eligible for evaluation. For example, if a threshold of five assigned days was implemented, nodes assigned for less than five days would not be evaluated individually. Instead, their performance would be extrapolated using the performance data of other nodes from the same node provider that met the threshold. While this approach appears more equitable, it introduces potential risks. A node provider could focus on maintaining only a small subset of nodes in optimal condition while neglecting others. Poorly maintained nodes could then be promptly unassigned, allowing their rewards to be calculated based on the well-performing subset.

This strategy could undermine the broader goal of incentivizing overall node health and ensuring the blockchain’s stability, where all nodes are expected to operate at a high standard.

Extrapolate based on all nodes performances

A compromise is to avoid extrapolating the performance of a node during its “assigned period” to its “unassigned period.” Instead, the node’s unassigned period performance could be calculated using the average performance of all active nodes during that period.

Applying this to the example above:

The node provider had 16 nodes assigned during the relevant period, 15 of which exhibited very good performance, while the node in question performed poorly. The average failure rate across these nodes can be calculated as:

Avg. FR = ((15 × ~ 0) + 0.41) / 16 = 0.025

Using this average, the node’s overall failure rate for the rewarding period would be:

FR(ognrk) = ((0.41 × 3) + (0.025 × 28)) / 31 = 6.2%

With a failure rate of 6.2%, the node would have achieved a 100% reward multiplier.

Lorimer · December 15, 2024, 11:09pm

Personally, I think there are numerous reasons to avoid extrapolating. There are good reasons to know the characteristics of each node you’re aiming to add to a subnet (not just the current state, but the recent history and performance details, which are only recorded for nodes that belong to a subnet). See proposals 134486 and 134491 as example cases.

Hey @sat, it’s great to hear that this is being considered. Has there been much movement on this?

sat · December 16, 2024, 10:25am

As explained in one of the earlier posts in this thread, going with a “big subnet” approach, or a few big subnets to be more precise, would likely end up with more work than doing the extrapolation. So my take would be to try out the extrapolation first, and if that doesn’t work well enough we can always switch from extrapolation to the “big subnet(s)” approach. Would that be reasonable @Lorimer ?

tina23 · December 16, 2024, 2:59pm

Hi Pietro, thank you for this and apologies for the late reply. I in principle agree with this approach. However, I would still think that at least a minimum number of assigned days should be present for a node to be evaluated at all. Personally I think it would be very hard to “game” the system in such a way so as to not maintain certain nodes on purpose. There could also be a “watchlist” maintained of nodes where such instances (bad performance and removed from subnet very quickly) to see if there is a pattern that would suggest intent.

Lorimer · December 16, 2024, 8:15pm

I think extrapolation is intended just for calculating rewards isn’t it? I’m just thinking that there are other good reasons to know how unassigned nodes will perform when you add them to a subnet.

Take the proposals I referenced above. If you try and look at how the unassigned nodes have been performing, this is what you get

Does extrapolating tell you much about whether the node is fit to add to a subnet? Presumably the node above wasn’t offline when a proposal was raised to add it to a subnet, however it probably would have been considered degraded (but as I understand, the node would need to be part of a subnet to measure this).

Lorimer · December 17, 2024, 5:25pm

I hope you don’t mind all the questions @sat. This is an interesting topic, and obviously an important one.

Further to my comments above, how are scenarios like this currently accounted for? →

I think extrapolating would only make sense if there’s not a selection pressure on the nodes that belong to subnets. But evidently there are (making the selected nodes unrepresentative of those that haven’t been selected).

Am I missing something?

sat · December 18, 2024, 9:55am

Not at all!

Not necessarily. We’re basically hoping or counting that the spatial autocorrelation / Moran’s I will hold, where nodes in the same DC should behave the same way. But I agree - it’s certainly not something that is absolutely true in all cases. For instance, if someone buys nodes in several batches, they will likely have different failure rates. However, most people will buy nodes for the same DC from the same supplier, and in a single batch. And the nodes would share the same power supply, cooling, internet uplink, etc. So I would expect the nodes in a single DC (from a single operator) to have similar reliability. But you’re right, it’s not a strong proof.

However, if we would add some nodes into an idle subnet, with almost no compute/storage/network requirements, that’s a very bad indicator of how the same node would behave if added into a very busy subnet with heavy compute&storage&network pressure.

I’m not following that question, sorry. Extrapolation wouldn’t work if someone has 1-2 nodes, as this NP currently has. I don’t see how this (spatial autocorrelation) could be done if there is nothing to correlate to. We could do temporal autocorrelation, possibly, by assuming that the nodes would behave in the future similar to how they performed in the past. But that has its downsides as well.

Note that this particular problem we’ve seen in here wouldn’t be solved if we used the “big evaluation subnet” approach, since the node experienced a hardware issue and this could happen at any time regardless of whether a node is in a subnet or not. So I’d rather concentrate on the cases like Seoul vs HongKong DC, both managed by the same NP and behaving drastically different.

Lorimer · December 19, 2024, 8:04am

I agree. This is why I’d prefer to see this done (for more reasons than one) →

… in other words, add the idle nodes to live subnets that could benefit from the added decentralisation (such as system subnets). The NPs are getting paid either way, so the costs are there regardless.

This would also make it much easier for the IC network as a whole to grow (due to flexible tolerances that voters can refer to).

My point regarding extrapolating based on a selected sample is simply that the sample could be cherry picked, given that the NP can influence which nodes are selected for inclusion into subnets. Does that make sense? In my opinion, that makes extrapolating inappropriate.

Topic		Replies	Views
Trustworthy Node Metrics for useful work Roadmap Discussing	3	1047	February 7, 2024
Node Provider Rewards Auditability Improvements NNS Governance nns , Node-admin	4	131	August 28, 2024
Handling of Node Provider rewards Governance	63	5901	January 15, 2024
Public Internet Computer (IC) Node metrics available now! Developers	2	252	July 9, 2024
Proposal: Node Rewards, Replica Geolocation, Cycle-Tiers General Discussing	8	1334	January 21, 2024