Subnet Management - lhg73 (Application)

@LaCosta @Lorimer on Oct 10 I had to pause ingestion of new metrics in node provider rewards dashboard for debugging that’s why you were not able to see most recent data.
Now it is up and running again.
@Lorimer Here is the Germany node you mentioned. Note the performance degradation right before the proposal went out

4 Likes

That makes sense why checking the 2 new faulty nodes with NPR returned no value at all. Actually now there are 3 reported by the dashboard.

2 Likes

Right, since the dashboard seems to be useful for these analyses, I’ll ensure it remains more stable going forward, even though it’s still in development.

3 Likes

Cross posting this message on subnet-node-issues-performance-degradation-on-subnet-lhg73

3 Likes

A replacement proposal is underway. Please stand by.

Update:

Proposal has been submitted:

Decentralization Nakamoto coefficient changes for subnet `lhg73-sax6z-2zank-6oer2-575lz-zgbxx-ptudx-5korm-fy7we-kh4hl-pqe`:
node_provider: 5.00 -> 5.00    (+0%)
  data_center: 5.00 -> 5.00    (+0%)

data_center_owner: 5.00 → 5.00 (+0%)
area: 5.00 → 5.00 (+0%)
country: 5.00 → 5.00 (+0%)


**Mean Nakamoto comparison:** 5.00 -> 5.00  (+0%)

Overall replacement impact: equal decentralization across all features

# Details


Nodes removed:
- `ddbl6-37efl-b75e4-jpfsb-zioa6-ilvzo-tldwy-fnbhm-nbuoy-66cza-uqe` [health: degraded, impact on decentralization: health: degraded]
- `ffsue-5rmb7-frfqk-gvfpg-gu2bo-udoa3-zputb-kexzk-gd667-fit5k-rae` [health: degraded, impact on decentralization: health: degraded]
- `rs26k-ldffa-z7lqu-rhjps-tumxa-arwvb-mdyye-mhevv-4dpjm-xo72t-mae` [health: dead, impact on decentralization: health: dead]

Nodes added:
- `x3woi-nzeuf-w5pan-yzwf6-7fgdn-hxhuk-st6rv-6lzam-akknw-ipvze-4ae` [health: healthy, impact on decentralization: (gets better) the number of different NP actors increases from 10 to 11]
- `uzy2p-nvzre-vqaov-qj7vj-bznwn-mllrr-t25mo-lbody-dgyf4-o5rj7-sqe` [health: healthy, impact on decentralization: (gets better) the average log2 of Nakamoto Coefficients across all features increases from 2.0000 to 2.3219]
- `um4so-2axl2-74745-lhyhn-afhgh-yog7e-pcsso-mxxho-7x5ev-x65uz-eqe` [health: healthy, impact on decentralization: (gets better) the number of different NP actors increases from 12 to 13]


node_provider                                                              data_center            data_center_owner              area                                  country        
-------------                                                              -----------            -----------------              ----                                  -------        
3oqw6-vmpk2-mlwlx-52z5x-e3p7u-fjlcw-yxc34-lf2zq-6ub2f-v63hk-lae       1    an1               1    Africa Data Centres       1    Bogota                           1    AU            1
3siog-htc6j-ed3wz-sguhu-2objz-g5qct-npoma-t3wwt-bd6wy-chwsi-4ae  1 -> 0    bc1          1 -> 0    Cyxtera              1 -> 0    British Columbia            1 -> 0    BE            1
4dibr-2alzr-h6kva-bvwn2-yqgsl-o577t-od46o-v275p-a2zov-tcw4f-eae       1    bg1               1    Data Inn             0 -> 1    California                  0 -> 1    CA       1 -> 0
6nbcy-kprg6-ax3db-kh3cz-7jllk-oceyh-jznhs-riguq-fvk6z-6tsds-rqe       1    hk1               1    Datacenter United         1    Flanders                         1    CO            1
7at4h-nhtvt-a4s55-jigss-wr2ha-ysxkn-e6w7x-7ggnm-qd3d5-ry66r-cae  1 -> 0    jb2               1    Digital Realty            1    Florida                     1 -> 0    EE            1
bvcsg-3od6r-jnydw-eysln-aql7w-td5zn-ay5m6-sibd2-jzojt-anwag-mqe       1    lj1          0 -> 1    EdgeUno                   1    Gauteng                          1    FR            1
fwnmn-zn7yt-5jaia-fkxlr-dzwyu-keguq-npfxq-mc72w-exeae-n5thj-oae       1    mr1               1    Equinix              1 -> 0    HongKong                         1    HK            1
nmdd6-rouxw-55leh-wcbkn-kejit-njvje-p4s6e-v64d3-nlbjb-vipul-mae       1    nm1               1    Flexential           1 -> 0    Ljubljana                   0 -> 1    IN            1
otzuu-dldzs-avvu2-qwowd-hdj73-aocy7-lacgi-carzj-m6f2r-ffluy-fae       1    sc1               1    INAP                 0 -> 1    Navi Mumbai                      1    JP       1 -> 0
r3yjn-kthmg-pfgmb-2fngg-5c7d7-t6kqg-wi37r-j7gy6-iee64-kjdja-jae       1    sg2               1    Megazone Cloud            1    Provence-Alpes-Cote d'Azur       1    KR            1
rbn2y-6vfsb-gv35j-4cyvy-pzbdu-e5aum-jzjg6-5b4n5-vuguf-ycubq-zae       1    sj1          0 -> 1    NEXTDC                    1    Queensland                       1    LT       0 -> 1
sixix-2nyqd-t2k2v-vlsyz-dssko-ls4hl-hyij4-y7mdp-ja6cj-nsmpf-yae  1 -> 0    sl1               1    Posita.si            0 -> 1    Seoul                            1    SG            1
ulyfm-vkxtj-o42dg-e4nam-l4tzf-37wci-ggntw-4ma7y-d267g-ywxi6-iae       1    ta2               1    Rivram                    1    Singapore                        1    SI       0 -> 1
vdzyg-amckj-thvl5-bsn52-2elzd-drgii-ryh4c-izba3-xaehb-sohtd-aae  0 -> 1    tp1          1 -> 0    Telia DC                  1    Tallinn                          1    US            1
vegae-c4chr-aetfj-7gzuh-c23sx-u2paz-vmvbn-bcage-pu7lu-mptnn-eqe  0 -> 1    ty1          1 -> 0    Telin                     1    Tokyo                       1 -> 0    ZA            1
wdjjk-blh44-lxm74-ojj43-rvgf4-j5rie-nm6xs-xvnuv-j3ptn-25t4v-6ae  0 -> 1    vl2          0 -> 1    Unicom                    1    Vilnius                     0 -> 1                   



Forum post link: https://forum.dfinity.org/t/subnet-management-lhg73-application/34055/45


Payload: ChangeSubnetMembershipPayload {
    subnet_id: lhg73-sax6z-2zank-6oer2-575lz-zgbxx-ptudx-5korm-fy7we-kh4hl-pqe,
    node_ids_add: [
        x3woi-nzeuf-w5pan-yzwf6-7fgdn-hxhuk-st6rv-6lzam-akknw-ipvze-4ae,
        uzy2p-nvzre-vqaov-qj7vj-bznwn-mllrr-t25mo-lbody-dgyf4-o5rj7-sqe,
        um4so-2axl2-74745-lhyhn-afhgh-yog7e-pcsso-mxxho-7x5ev-x65uz-eqe,
    ],
    node_ids_remove: [
        ddbl6-37efl-b75e4-jpfsb-zioa6-ilvzo-tldwy-fnbhm-nbuoy-66cza-uqe,
        ffsue-5rmb7-frfqk-gvfpg-gu2bo-udoa3-zputb-kexzk-gd667-fit5k-rae,
        rs26k-ldffa-z7lqu-rhjps-tumxa-arwvb-mdyye-mhevv-4dpjm-xo72t-mae,
    ],
}
2 Likes

Hi @ZackDS @Lorimer @timk11 @LaCosta, we have another 3 nodes degraded on the lhg73 subnet, we’re looking into the cause of this, and why it does not occur on other subnets. For now, we think it might be related to a kernel bug that is triggered by the large replica state of this subnet. Can I ask your review of the new proposal 133436. We aim to get the nodes replaced before the weekend.

4 Likes

Proposal 133436

3 degraded/down nodes replaced with unassigned nodes in the US, Slovenia, Lithuania. The only obvious impact on decentralisation is that there’s a slight reduction in the average geographic distance between nodes, but that’s a tiny downside. Looks good, I’ve voted to adopt

Decentralisation Stats

Subnet node distance stats (distance between any 2 nodes in the subnet) →

Smallest Distance Average Distance Largest Distance
EXISTING 317.676 km 9071.275 km 18505.029 km
PROPOSED 317.676 km 8547.967 km (-5.8%) 18505.029 km

This proposal slightly reduces decentralisation, considered purely in terms of geographic distance (and therefore there’s a slight theoretical reduction in localised disaster resilience).

Subnet characteristic counts →

Continents Countries Data Centers Owners Node Providers
EXISTING 5 13 13 13 13
PROPOSED 5 13 13 13 13

Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) →

Continent Country Data Center Owner Node Provider
EXISTING 5 1 1 1 1
PROPOSED 5 1 1 1 1

See here for acceptable limits → Motion 132136

The above subnet information is illustrated below, followed by a node reference table:

Map Description
  • Red marker represents a removed node (transparent center for overlap visibility)
  • Green marker represents an added node
  • Blue marker represents an unchanged node
  • Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
  • Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)

Table
Continent Country Data Center Owner Node Provider Node Status
--- Americas Canada Vancouver (bc1) Cyxtera Blockchain Development Labs ddbl6-37efl-b75e4-jpfsb-zioa6-ilvzo-tldwy-fnbhm-nbuoy-66cza-uqe DEGRADED
--- Asia Japan Tokyo (ty1) Equinix Starbase rs26k-ldffa-z7lqu-rhjps-tumxa-arwvb-mdyye-mhevv-4dpjm-xo72t-mae DOWN
--- Americas United States of America (the) Tampa (tp1) Flexential Mika Properties, LLC ffsue-5rmb7-frfqk-gvfpg-gu2bo-udoa3-zputb-kexzk-gd667-fit5k-rae DEGRADED
+++ Europe Lithuania Vilnius 2 (vl2) Data Inn George Bassadone um4so-2axl2-74745-lhyhn-afhgh-yog7e-pcsso-mxxho-7x5ev-x65uz-eqe UNASSIGNED
+++ Europe Slovenia Ljubljana (lj1) Posita.si Fractal Labs AG uzy2p-nvzre-vqaov-qj7vj-bznwn-mllrr-t25mo-lbody-dgyf4-o5rj7-sqe UNASSIGNED
+++ Americas United States of America (the) San Jose (sj1) INAP Mary Ren x3woi-nzeuf-w5pan-yzwf6-7fgdn-hxhuk-st6rv-6lzam-akknw-ipvze-4ae UNASSIGNED
Oceania Australia Queensland 1 (sc1) NEXTDC ANYPOINT PTY LTD 56ovz-lrvyd-gggsl-qtenl-uuokx-p7t3t-rg6mc-6lc5l-usfqb-fygiv-aqe UP
Europe Belgium Antwerp (an1) Datacenter United Allusion mihvd-umv3j-cjsl2-bfsdu-td7aw-2y6if-aw4fn-cghkm-v2oxd-kj75q-cae UP
Asia China HongKong 1 (hk1) Unicom Pindar Technology Limited r4642-fjh73-ltnlt-qhdcr-kfujl-v3up4-5pnci-o7zqi-set7a-pvi3p-5qe UP
Americas Costa Rica Bogota 1 (bg1) EdgeUno Geeta Kalwani ihttm-45oz5-an5mg-i2jtb-fayst-s47j6-vmuwr-fqotf-mp2il-n5s5x-cae UP
Europe Germany Marseille (mr1) Digital Realty DFINITY Operations SA vlbpb-szwgu-jqul3-kpfje-ozgkp-jdwz4-f2hhq-6w6tf-e2lrl-4bmad-xae UP
Europe Estonia Tallinn 2 (ta2) Telia DC Vladyslav Popov bptaj-nejw4-osqqa-zwrej-ysl2o-5ffgj-hkjr6-2w6fi-jczex-vjutw-iae UP
Asia India Navi Mumbai 1 (nm1) Rivram Rivram Inc pdo46-iehoo-x2gfu-t5qu5-y3e64-cdymo-eioop-h6f4a-zebwa-fenb4-xae UP
Asia Korea (the Republic of) Seoul 1 (sl1) Megazone Cloud Neptune Partners ixo23-jxvux-ktqca-bje7d-py56s-yvjy5-zpxrk-fmlxt-zhuhg-wu5bc-wqe UP
Asia Singapore Singapore 2 (sg2) Telin OneSixtyTwo Digital Capital cpywp-n4j5f-ja44p-oykxm-umz7h-fk6v2-rowix-bkwc4-ly4fw-tvu6c-mae UP
Africa South Africa Gauteng 2 (jb2) Africa Data Centres Honeycomb Capital (Pty) Ltd 5v4on-bsceg-rdgxe-zcqqf-l5wnq-fpxw7-x3ktj-3x4fs-o2cny-uzhor-vqe UP

Known Neurons to follow if you're too busy to keep on top of things like this

If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.

Other good neurons to follow:

  • Synapse (follows the LORIMER and CodeGov known neurons for Subnet Management, and is a generally well informed known neuron to follow on numerous other topics)

  • CodeGov (actively reviews and votes on Subnet Management proposals, and is well informed on numerous other technical topics)

  • WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)

2 Likes

Voted to adopt proposal 133436.

This proposal replaces 3 nodes identified as dead or degraded in subnet lhg73 and does so without any negative impacts on Nakamoto coefficients or target topology.

The decentralization tool shows the current and to-be-added nodes in this subnet like so:

The nodes to be replaced are:

  • ddbl6 - shown on the IC Dashboard as “Status: Offline”. In the Node Provider Rewards tool it appears to be working reasonably well, with a block failure rate of about 2% on the last day shown and a lower rate for several days prior.

  • ffsue - shown on the IC Dashboard as “Status: Degraded / Status Details: IC_Replica_Behind”. It appears to be functioning poorly with a block failure rate ranging from 5% to 16% over the last 5 days.

  • rs26k - shown on the IC Dashboard as “Status: Offline”. This was one of the nodes added a few days ago in proposal 133404. It appears only to have been assigned for one day (or less) with a block failure rate just under 4%.

Additionally, node ixo23 appears in the decentralization tool as “DOWN” and in the dashboard as “Status: Offline” but is not mentioned in this proposal. It appears to have been functioning well until yesterday (17 Oct), when it experienced a block failure rate of 38%.

I’ve voted to adopt this proposal as it replaces at least 2 poorly functioning nodes while maintaining decentralisation targets, and there is clearly an issue with this subnet that needs ongoing and prompt maintenance.

Questions: What rate of block failure is considered sufficiently bad to warrant replacing a node? From my understanding so far, I would probably have replaced ffsue, rs26k and ixo23 and left ddbl6 alone. Am I missing anything? What tools is Dfinity using to make this decision and which are the key metrics? In the event that there are only 9 functioning nodes in a 13-node subnet, does consensus now only need 7 nodes to agree instead of 9?

5 Likes

Voted to adopt proposal 133436.
It is correct and everything matches description, motivation, degraded/ offline and the healthy replacement node id’s.

Meanwhile the proposal is live looks like we have a new node that failed.

So the question now is how fast can nodes be replaced and will they all fail eventually ?

At least until a hotfix is pushed if the above is the case.

3 Likes

@ZackDS yes the node provider in Korea contacted as there was a problem with one of their node machines in the Seoul DC. Coincidentally this is also on the lhg73 subnet, as all other nodes in this DC seem to be running okay. We will replace this node as part of the regular dre-heal proposals that are submitted later today on Friday.

So thanks for adopting the previous proposal, as you can see a subnet can handle 4 degraded nodes but that is not something that should exist for a long time.

2 Likes

The team is still looking into it, whether it was just a coincidence or whether it is related to a bug in the kernel/ubuntu. Will follow up when we know more.

2 Likes

Thank you for the update, much appreciated.
Regarding @timk11 's question, correct me if my assumption is wrong :

Given the fact that currently, quote from NPR (Node Provider Rewards) : “For failure rates ≤ 10%, there is no reduction in rewards.” I assume that ~10% is acceptable and that doesn’t require a node replacement. So beyond that is there any other factor that is used when choosing to replace a node ? One that I would think is important is the availability of another healthy one on standby that keeps decentralization at least the same. Thanks.

1 Like

I think you refer to this forum thread for the 10% boundary. The 10% is chosen mainly from the perspective of node rewards, and how much average downtime during the reward period - e.g. a month - would be reasonable. It is currently (not yet) related to the actual replacement of nodes.

For replacement of nodes, DRE normally runs the so called dre-heal check every Friday, to replace degraded nodes, and these nodes would then be swapped on Monday. So worst case if a node “would die” on Saturday, if would be replaced by the Monday in the week following, so be 10 days offline. On a monthly bases this would be something approximately10/30 is 33% unavailability that is technically allowed, but for the part above the 10% there would be a reward reduction (if that eventually gets implemented).

It would be good to have some criteria for when to replace unhealthy nodes. We now have:

  • running the dre-heal every week on Friday, resulting in dead nodes being swapped on Monday
  • if the situation requests, e.g. when there are 3 or more unhealthy nodes in a subnet, an urgent NNS proposal is submitted.

But we definitely could add some other criteria.

2 Likes

Thanks @SvenF for giving some further detail on the process that’s being followed. Would you, @dmanu or other team members be able to answer these questions from my earlier post?

1 Like

Sorry for the delay. I have voted to adopt proposal 133436 that replaces three nodes on subnet lhg73:

Proposal 133446

1 down node in South Korea replaced with a node in Switzerland. Decentralisation is slightly reduced in some respects, but not on any important metrics. I’ve voted to adopt.

Decentralisation Stats

Subnet node distance stats (distance between any 2 nodes in the subnet) →

Smallest Distance Average Distance Largest Distance
EXISTING 317.676 km 8547.967 km 18505.029 km
PROPOSED 317.676 km 8280.444 km (-3.1%) 18505.029 km

This proposal slightly reduces decentralisation, considered purely in terms of geographic distance (and therefore there’s a slight theoretical reduction in localised disaster resilience).

Subnet characteristic counts →

Continents Countries Data Centers Owners Node Providers
EXISTING 5 13 13 13 13
PROPOSED 5 13 13 13 13

Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) →

Continent Country Data Center Owner Node Provider
EXISTING 5 1 1 1 1
PROPOSED 6 (+20%) 1 1 1 1

Slightly worse situation regarding clustering in a single continent.

See here for acceptable limits → Motion 132136

The above subnet information is illustrated below, followed by a node reference table:

Map Description
  • Red marker represents a removed node (transparent center for overlap visibility)
  • Green marker represents an added node
  • Blue marker represents an unchanged node
  • Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
  • Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)

Table
Continent Country Data Center Owner Node Provider Node Status
--- Asia Korea (the Republic of) Seoul 1 (sl1) Megazone Cloud Neptune Partners ixo23-jxvux-ktqca-bje7d-py56s-yvjy5-zpxrk-fmlxt-zhuhg-wu5bc-wqe DOWN
+++ Europe Switzerland Geneva (ge1) HighDC Archery Blockchain SCSp c3xxv-clwfo-3pfn3-srt75-3wxn7-twfzm-7o5g6-pfvhd-rvsuj-47d3c-2qe UNASSIGNED
Oceania Australia Queensland 1 (sc1) NEXTDC ANYPOINT PTY LTD 56ovz-lrvyd-gggsl-qtenl-uuokx-p7t3t-rg6mc-6lc5l-usfqb-fygiv-aqe UP
Europe Belgium Antwerp (an1) Datacenter United Allusion mihvd-umv3j-cjsl2-bfsdu-td7aw-2y6if-aw4fn-cghkm-v2oxd-kj75q-cae UP
Asia China HongKong 1 (hk1) Unicom Pindar Technology Limited r4642-fjh73-ltnlt-qhdcr-kfujl-v3up4-5pnci-o7zqi-set7a-pvi3p-5qe UP
Americas Costa Rica Bogota 1 (bg1) EdgeUno Geeta Kalwani ihttm-45oz5-an5mg-i2jtb-fayst-s47j6-vmuwr-fqotf-mp2il-n5s5x-cae UP
Europe Germany Marseille (mr1) Digital Realty DFINITY Operations SA vlbpb-szwgu-jqul3-kpfje-ozgkp-jdwz4-f2hhq-6w6tf-e2lrl-4bmad-xae UP
Europe Estonia Tallinn 2 (ta2) Telia DC Vladyslav Popov bptaj-nejw4-osqqa-zwrej-ysl2o-5ffgj-hkjr6-2w6fi-jczex-vjutw-iae UP
Asia India Navi Mumbai 1 (nm1) Rivram Rivram Inc pdo46-iehoo-x2gfu-t5qu5-y3e64-cdymo-eioop-h6f4a-zebwa-fenb4-xae UP
Europe Lithuania Vilnius 2 (vl2) Data Inn George Bassadone um4so-2axl2-74745-lhyhn-afhgh-yog7e-pcsso-mxxho-7x5ev-x65uz-eqe UP
Asia Singapore Singapore 2 (sg2) Telin OneSixtyTwo Digital Capital cpywp-n4j5f-ja44p-oykxm-umz7h-fk6v2-rowix-bkwc4-ly4fw-tvu6c-mae UP
Europe Slovenia Ljubljana (lj1) Posita.si Fractal Labs AG uzy2p-nvzre-vqaov-qj7vj-bznwn-mllrr-t25mo-lbody-dgyf4-o5rj7-sqe UP
Americas United States of America (the) San Jose (sj1) INAP Mary Ren x3woi-nzeuf-w5pan-yzwf6-7fgdn-hxhuk-st6rv-6lzam-akknw-ipvze-4ae UP
Africa South Africa Gauteng 2 (jb2) Africa Data Centres Honeycomb Capital (Pty) Ltd 5v4on-bsceg-rdgxe-zcqqf-l5wnq-fpxw7-x3ktj-3x4fs-o2cny-uzhor-vqe UP

Known Neurons to follow if you're too busy to keep on top of things like this

If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.

Other good neurons to follow:

  • Synapse (follows the LORIMER and CodeGov known neurons for Subnet Management, and is a generally well informed known neuron to follow on numerous other topics)

  • CodeGov (actively reviews and votes on Subnet Management proposals, and is well informed on numerous other technical topics)

  • WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)

2 Likes

Voted to adopt proposal 133446.

Archery Blockchain SCSp 's c3xxv node from Geneva is set to replace the offline one from Seoul1 and thus adding to the number of different node providers.

With this underway there seems to be a new issue this time with 56ovz .

1 Like

Voted to adopt proposal 133446.

This proposal replaces node ixo23, which was noted as a dead node in my earlier post. Decentralisation parameters are unchanged and remain within the requirements of the target topology.

1 Like

Voted to adopt proposal 133446. The proposal replaces the dead node ixo23 (Dashboard Status: Offline) with the node c3xxv on subnet lhg73. There is no impact in the Nakamoto Coefficients as verified through the Dre tool.

1 Like

Hi :wave: @SevenF, within less than a week, the lhg73 subnet has experienced node degradation and offline issues again. I request the Dfinity team to look into it. Thank you.

1 Like