@LaCosta @Lorimer on Oct 10 I had to pause ingestion of new metrics in node provider rewards dashboard for debugging that’s why you were not able to see most recent data.
Now it is up and running again.
@Lorimer Here is the Germany node you mentioned. Note the performance degradation right before the proposal went out
That makes sense why checking the 2 new faulty nodes with NPR returned no value at all. Actually now there are 3 reported by the dashboard.
Right, since the dashboard seems to be useful for these analyses, I’ll ensure it remains more stable going forward, even though it’s still in development.
A replacement proposal is underway. Please stand by.
Update:
Decentralization Nakamoto coefficient changes for subnet `lhg73-sax6z-2zank-6oer2-575lz-zgbxx-ptudx-5korm-fy7we-kh4hl-pqe`:
node_provider: 5.00 -> 5.00 (+0%)
data_center: 5.00 -> 5.00 (+0%)
data_center_owner: 5.00 → 5.00 (+0%)
area: 5.00 → 5.00 (+0%)
country: 5.00 → 5.00 (+0%)
**Mean Nakamoto comparison:** 5.00 -> 5.00 (+0%)
Overall replacement impact: equal decentralization across all features
# Details
Nodes removed:
- `ddbl6-37efl-b75e4-jpfsb-zioa6-ilvzo-tldwy-fnbhm-nbuoy-66cza-uqe` [health: degraded, impact on decentralization: health: degraded]
- `ffsue-5rmb7-frfqk-gvfpg-gu2bo-udoa3-zputb-kexzk-gd667-fit5k-rae` [health: degraded, impact on decentralization: health: degraded]
- `rs26k-ldffa-z7lqu-rhjps-tumxa-arwvb-mdyye-mhevv-4dpjm-xo72t-mae` [health: dead, impact on decentralization: health: dead]
Nodes added:
- `x3woi-nzeuf-w5pan-yzwf6-7fgdn-hxhuk-st6rv-6lzam-akknw-ipvze-4ae` [health: healthy, impact on decentralization: (gets better) the number of different NP actors increases from 10 to 11]
- `uzy2p-nvzre-vqaov-qj7vj-bznwn-mllrr-t25mo-lbody-dgyf4-o5rj7-sqe` [health: healthy, impact on decentralization: (gets better) the average log2 of Nakamoto Coefficients across all features increases from 2.0000 to 2.3219]
- `um4so-2axl2-74745-lhyhn-afhgh-yog7e-pcsso-mxxho-7x5ev-x65uz-eqe` [health: healthy, impact on decentralization: (gets better) the number of different NP actors increases from 12 to 13]
node_provider data_center data_center_owner area country
------------- ----------- ----------------- ---- -------
3oqw6-vmpk2-mlwlx-52z5x-e3p7u-fjlcw-yxc34-lf2zq-6ub2f-v63hk-lae 1 an1 1 Africa Data Centres 1 Bogota 1 AU 1
3siog-htc6j-ed3wz-sguhu-2objz-g5qct-npoma-t3wwt-bd6wy-chwsi-4ae 1 -> 0 bc1 1 -> 0 Cyxtera 1 -> 0 British Columbia 1 -> 0 BE 1
4dibr-2alzr-h6kva-bvwn2-yqgsl-o577t-od46o-v275p-a2zov-tcw4f-eae 1 bg1 1 Data Inn 0 -> 1 California 0 -> 1 CA 1 -> 0
6nbcy-kprg6-ax3db-kh3cz-7jllk-oceyh-jznhs-riguq-fvk6z-6tsds-rqe 1 hk1 1 Datacenter United 1 Flanders 1 CO 1
7at4h-nhtvt-a4s55-jigss-wr2ha-ysxkn-e6w7x-7ggnm-qd3d5-ry66r-cae 1 -> 0 jb2 1 Digital Realty 1 Florida 1 -> 0 EE 1
bvcsg-3od6r-jnydw-eysln-aql7w-td5zn-ay5m6-sibd2-jzojt-anwag-mqe 1 lj1 0 -> 1 EdgeUno 1 Gauteng 1 FR 1
fwnmn-zn7yt-5jaia-fkxlr-dzwyu-keguq-npfxq-mc72w-exeae-n5thj-oae 1 mr1 1 Equinix 1 -> 0 HongKong 1 HK 1
nmdd6-rouxw-55leh-wcbkn-kejit-njvje-p4s6e-v64d3-nlbjb-vipul-mae 1 nm1 1 Flexential 1 -> 0 Ljubljana 0 -> 1 IN 1
otzuu-dldzs-avvu2-qwowd-hdj73-aocy7-lacgi-carzj-m6f2r-ffluy-fae 1 sc1 1 INAP 0 -> 1 Navi Mumbai 1 JP 1 -> 0
r3yjn-kthmg-pfgmb-2fngg-5c7d7-t6kqg-wi37r-j7gy6-iee64-kjdja-jae 1 sg2 1 Megazone Cloud 1 Provence-Alpes-Cote d'Azur 1 KR 1
rbn2y-6vfsb-gv35j-4cyvy-pzbdu-e5aum-jzjg6-5b4n5-vuguf-ycubq-zae 1 sj1 0 -> 1 NEXTDC 1 Queensland 1 LT 0 -> 1
sixix-2nyqd-t2k2v-vlsyz-dssko-ls4hl-hyij4-y7mdp-ja6cj-nsmpf-yae 1 -> 0 sl1 1 Posita.si 0 -> 1 Seoul 1 SG 1
ulyfm-vkxtj-o42dg-e4nam-l4tzf-37wci-ggntw-4ma7y-d267g-ywxi6-iae 1 ta2 1 Rivram 1 Singapore 1 SI 0 -> 1
vdzyg-amckj-thvl5-bsn52-2elzd-drgii-ryh4c-izba3-xaehb-sohtd-aae 0 -> 1 tp1 1 -> 0 Telia DC 1 Tallinn 1 US 1
vegae-c4chr-aetfj-7gzuh-c23sx-u2paz-vmvbn-bcage-pu7lu-mptnn-eqe 0 -> 1 ty1 1 -> 0 Telin 1 Tokyo 1 -> 0 ZA 1
wdjjk-blh44-lxm74-ojj43-rvgf4-j5rie-nm6xs-xvnuv-j3ptn-25t4v-6ae 0 -> 1 vl2 0 -> 1 Unicom 1 Vilnius 0 -> 1
Forum post link: https://forum.dfinity.org/t/subnet-management-lhg73-application/34055/45
Payload: ChangeSubnetMembershipPayload {
subnet_id: lhg73-sax6z-2zank-6oer2-575lz-zgbxx-ptudx-5korm-fy7we-kh4hl-pqe,
node_ids_add: [
x3woi-nzeuf-w5pan-yzwf6-7fgdn-hxhuk-st6rv-6lzam-akknw-ipvze-4ae,
uzy2p-nvzre-vqaov-qj7vj-bznwn-mllrr-t25mo-lbody-dgyf4-o5rj7-sqe,
um4so-2axl2-74745-lhyhn-afhgh-yog7e-pcsso-mxxho-7x5ev-x65uz-eqe,
],
node_ids_remove: [
ddbl6-37efl-b75e4-jpfsb-zioa6-ilvzo-tldwy-fnbhm-nbuoy-66cza-uqe,
ffsue-5rmb7-frfqk-gvfpg-gu2bo-udoa3-zputb-kexzk-gd667-fit5k-rae,
rs26k-ldffa-z7lqu-rhjps-tumxa-arwvb-mdyye-mhevv-4dpjm-xo72t-mae,
],
}
Hi @ZackDS @Lorimer @timk11 @LaCosta, we have another 3 nodes degraded on the lhg73 subnet, we’re looking into the cause of this, and why it does not occur on other subnets. For now, we think it might be related to a kernel bug that is triggered by the large replica state of this subnet. Can I ask your review of the new proposal 133436. We aim to get the nodes replaced before the weekend.
Proposal 133436
3 degraded/down nodes replaced with unassigned nodes in the US, Slovenia, Lithuania. The only obvious impact on decentralisation is that there’s a slight reduction in the average geographic distance between nodes, but that’s a tiny downside. Looks good, I’ve voted to adopt
Decentralisation Stats
Subnet node distance stats (distance between any 2 nodes in the subnet) →
Smallest Distance | Average Distance | Largest Distance | |
---|---|---|---|
EXISTING | 317.676 km | 9071.275 km | 18505.029 km |
PROPOSED | 317.676 km | 8547.967 km (-5.8%) | 18505.029 km |
This proposal slightly reduces decentralisation, considered purely in terms of geographic distance (and therefore there’s a slight theoretical reduction in localised disaster resilience).
Subnet characteristic counts →
Continents | Countries | Data Centers | Owners | Node Providers | |
---|---|---|---|---|---|
EXISTING | 5 | 13 | 13 | 13 | 13 |
PROPOSED | 5 | 13 | 13 | 13 | 13 |
Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) →
Continent | Country | Data Center | Owner | Node Provider | |
---|---|---|---|---|---|
EXISTING | 5 | 1 | 1 | 1 | 1 |
PROPOSED | 5 | 1 | 1 | 1 | 1 |
See here for acceptable limits → Motion 132136
The above subnet information is illustrated below, followed by a node reference table:
Map Description
- Red marker represents a removed node (transparent center for overlap visibility)
- Green marker represents an added node
- Blue marker represents an unchanged node
- Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
- Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)
Table
Known Neurons to follow if you're too busy to keep on top of things like this
If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.
Other good neurons to follow:
-
Synapse (follows the LORIMER and CodeGov known neurons for Subnet Management, and is a generally well informed known neuron to follow on numerous other topics)
-
CodeGov (actively reviews and votes on Subnet Management proposals, and is well informed on numerous other technical topics)
-
WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)
Voted to adopt proposal 133436.
This proposal replaces 3 nodes identified as dead or degraded in subnet lhg73 and does so without any negative impacts on Nakamoto coefficients or target topology.
The decentralization tool shows the current and to-be-added nodes in this subnet like so:
The nodes to be replaced are:
- ddbl6 - shown on the IC Dashboard as “Status: Offline”. In the Node Provider Rewards tool it appears to be working reasonably well, with a block failure rate of about 2% on the last day shown and a lower rate for several days prior.
- ffsue - shown on the IC Dashboard as “Status: Degraded / Status Details: IC_Replica_Behind”. It appears to be functioning poorly with a block failure rate ranging from 5% to 16% over the last 5 days.
- rs26k - shown on the IC Dashboard as “Status: Offline”. This was one of the nodes added a few days ago in proposal 133404. It appears only to have been assigned for one day (or less) with a block failure rate just under 4%.
Additionally, node ixo23 appears in the decentralization
tool as “DOWN” and in the dashboard as “Status: Offline” but is not mentioned in this proposal. It appears to have been functioning well until yesterday (17 Oct), when it experienced a block failure rate of 38%.
I’ve voted to adopt this proposal as it replaces at least 2 poorly functioning nodes while maintaining decentralisation targets, and there is clearly an issue with this subnet that needs ongoing and prompt maintenance.
Questions: What rate of block failure is considered sufficiently bad to warrant replacing a node? From my understanding so far, I would probably have replaced ffsue, rs26k and ixo23 and left ddbl6 alone. Am I missing anything? What tools is Dfinity using to make this decision and which are the key metrics? In the event that there are only 9 functioning nodes in a 13-node subnet, does consensus now only need 7 nodes to agree instead of 9?
Voted to adopt proposal 133436.
It is correct and everything matches description, motivation, degraded/ offline and the healthy replacement node id’s.
Meanwhile the proposal is live looks like we have a new node that failed.
So the question now is how fast can nodes be replaced and will they all fail eventually ?
At least until a hotfix is pushed if the above is the case.
@ZackDS yes the node provider in Korea contacted as there was a problem with one of their node machines in the Seoul DC. Coincidentally this is also on the lhg73 subnet, as all other nodes in this DC seem to be running okay. We will replace this node as part of the regular dre-heal proposals that are submitted later today on Friday.
So thanks for adopting the previous proposal, as you can see a subnet can handle 4 degraded nodes but that is not something that should exist for a long time.
The team is still looking into it, whether it was just a coincidence or whether it is related to a bug in the kernel/ubuntu. Will follow up when we know more.
Thank you for the update, much appreciated.
Regarding @timk11 's question, correct me if my assumption is wrong :
Given the fact that currently, quote from NPR (Node Provider Rewards) : “For failure rates ≤ 10%, there is no reduction in rewards.” I assume that ~10% is acceptable and that doesn’t require a node replacement. So beyond that is there any other factor that is used when choosing to replace a node ? One that I would think is important is the availability of another healthy one on standby that keeps decentralization at least the same. Thanks.
I think you refer to this forum thread for the 10% boundary. The 10% is chosen mainly from the perspective of node rewards, and how much average downtime during the reward period - e.g. a month - would be reasonable. It is currently (not yet) related to the actual replacement of nodes.
For replacement of nodes, DRE normally runs the so called dre-heal check every Friday, to replace degraded nodes, and these nodes would then be swapped on Monday. So worst case if a node “would die” on Saturday, if would be replaced by the Monday in the week following, so be 10 days offline. On a monthly bases this would be something approximately10/30 is 33% unavailability that is technically allowed, but for the part above the 10% there would be a reward reduction (if that eventually gets implemented).
It would be good to have some criteria for when to replace unhealthy nodes. We now have:
- running the dre-heal every week on Friday, resulting in dead nodes being swapped on Monday
- if the situation requests, e.g. when there are 3 or more unhealthy nodes in a subnet, an urgent NNS proposal is submitted.
But we definitely could add some other criteria.
Thanks @SvenF for giving some further detail on the process that’s being followed. Would you, @dmanu or other team members be able to answer these questions from my earlier post?
Sorry for the delay. I have voted to adopt proposal 133436 that replaces three nodes on subnet lhg73:
- node ddbl6-37efl-b75e4-jpfsb-zioa6-ilvzo-tldwy-fnbhm-nbuoy-66cza-uqe (Dasboard Status: Offline),
- node ffsue-5rmb7-frfqk-gvfpg-gu2bo-udoa3-zputb-kexzk-gd667-fit5k-rae (Dasboard Status: Awaiting as of this moment but was previously degraded, the Node Provider Rewards tool shows a AFR for this node of 21% over the past 6 days so it’s fair to say it was performing poorly),
- node rs26k-ldffa-z7lqu-rhjps-tumxa-arwvb-mdyye-mhevv-4dpjm-xo72t-mae (Dasboard Status: Offline).
The Nakamoto Coefficient remains the same as verified by the Dre tool
Proposal 133446
1 down node in South Korea replaced with a node in Switzerland. Decentralisation is slightly reduced in some respects, but not on any important metrics. I’ve voted to adopt.
Decentralisation Stats
Subnet node distance stats (distance between any 2 nodes in the subnet) →
Smallest Distance | Average Distance | Largest Distance | |
---|---|---|---|
EXISTING | 317.676 km | 8547.967 km | 18505.029 km |
PROPOSED | 317.676 km | 8280.444 km (-3.1%) | 18505.029 km |
This proposal slightly reduces decentralisation, considered purely in terms of geographic distance (and therefore there’s a slight theoretical reduction in localised disaster resilience).
Subnet characteristic counts →
Continents | Countries | Data Centers | Owners | Node Providers | |
---|---|---|---|---|---|
EXISTING | 5 | 13 | 13 | 13 | 13 |
PROPOSED | 5 | 13 | 13 | 13 | 13 |
Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) →
Continent | Country | Data Center | Owner | Node Provider | |
---|---|---|---|---|---|
EXISTING | 5 | 1 | 1 | 1 | 1 |
PROPOSED | 6 (+20%) | 1 | 1 | 1 | 1 |
Slightly worse situation regarding clustering in a single continent.
See here for acceptable limits → Motion 132136
The above subnet information is illustrated below, followed by a node reference table:
Map Description
- Red marker represents a removed node (transparent center for overlap visibility)
- Green marker represents an added node
- Blue marker represents an unchanged node
- Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
- Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)
Table
Known Neurons to follow if you're too busy to keep on top of things like this
If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.
Other good neurons to follow:
-
Synapse (follows the LORIMER and CodeGov known neurons for Subnet Management, and is a generally well informed known neuron to follow on numerous other topics)
-
CodeGov (actively reviews and votes on Subnet Management proposals, and is well informed on numerous other technical topics)
-
WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)
Voted to adopt proposal 133446.
Archery Blockchain SCSp 's c3xxv node from Geneva is set to replace the offline one from Seoul1 and thus adding to the number of different node providers.
With this underway there seems to be a new issue this time with 56ovz .
Voted to adopt proposal 133446.
This proposal replaces node ixo23, which was noted as a dead node in my earlier post. Decentralisation parameters are unchanged and remain within the requirements of the target topology.
Voted to adopt proposal 133446. The proposal replaces the dead node ixo23 (Dashboard Status: Offline) with the node c3xxv on subnet lhg73. There is no impact in the Nakamoto Coefficients as verified through the Dre tool.