This topic is intended to capture Subnet Management activities over time for the lhg73 subnet, providing a place to ask questions and make observations about the management of this subnet.
At the time of creating this topic the current subnet configuration is as follows:
The other node removed by this proposal ( bptaj-nejw4-osqqa-zwrej-ysl2o-5ffgj-hkjr6-2w6fi-jczex-vjutw-iae) appears healthy, is located in Estonia, and owned by Telia DG (which doesn’t own any other node in this subnet). I think it’s unclear why this node is being removed…
The 2 nodes that are proposed to replace these far flung nodes are both located in central Europe (Germany and Switzerland), practically on each other’s doorstep by comparison.
I’ve rejected this proposal because the outcome seems in clear contradiction with the proposal summary. The summary could have been clearer about what it’s trying to achieve and why.
Cycle burn and transactions have shot up, and blocks are steadily following (albeit at a slower pace). Would be interesting to know why this latter metric is taking a longer time to bounce back.
DFINITY will submit an NNS proposal today to reduce the notarization delay on the subnet, lhg73, similar to what has happened on other subnets in recent weeks (you can find all details in this forum thread).
This proposal replaces a dead node (pfmqh, which appears as “Status: Offline” on the dashboard) and additionally replaces another node for the purpose of improving network decentralisation. As seen in the proposal (which I verified using the DRE tool), the overall effect of these additions is to improve decentralisation with respect to continents only. The status with respect to the target topology is unchanged and remains within the requirements.
Thanks @Sat! 2 removed nodes, in China (up) and Japan (offline), replaced with a nodes in South Africa (unassigned) and Australia (unassigned). This proposal also increases decentralisation in terms of average geographic distance between nodes, and continent diversity. I’ve voted to adopt.
Decentralisation Stats
Subnet node distance stats (distance between any 2 nodes in the subnet) →
Smallest Distance
Average Distance
Largest Distance
EXISTING
317.676 km
7583.196 km
18505.029 km
PROPOSED
317.676 km
8678.125 km (+14.4%)
18505.029 km
This proposal increases decentralisation, considered purely in terms of geographic distance (and therefore there’s a slight theoretical increase in localised disaster resilience).
Subnet characteristic counts →
Continents
Countries
Data Centers
Owners
Node Providers
EXISTING
3
13
13
13
13
PROPOSED
5 (+40%)
13
13
13
13
This proposal slightly improves decentralisation in terms of continent diversity.
Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) →
The above subnet information is illustrated below, followed by a node reference table:
Map Description
Red marker represents a removed node (transparent center for overlap visibility)
Green marker represents an added node
Blue marker represents an unchanged node
Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)
Known Neurons to follow if you're too busy to keep on top of things like this
If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.
Other good neurons to follow:
Synapse (follows the LORIMER and CodeGov known neurons for Subnet Management, and is a generally well informed known neuron to follow on numerous other topics)
CodeGov (actively reviews and votes on Subnet Management proposals, and is well informed on numerous other technical topics)
WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)
The proposal replaces the dead node, pfmqh with an Offline Status on the dashboard, in the lhg73 subnet.
The proposal also takes the opportunity to replace another node on the subnet in order to increase the Nakamoto coefficient of the country metric as verified with the Dre tool.
Decentralisation stats are unaffected by this proposal. This proposal proposes replacing a ‘degraded’ node with another node in Singapore. The node in question is actually currently ‘up’ rather than ‘degraded’. However, it is consistently failing a very small fraction of blocks.
@sat, are you able to provide more information about what it actually means for a node to be considered degraded? i.e. What’s the threshold that a node needs to cross to go from being considered ‘up’ to ‘degraded’?
Decentralisation Stats
Subnet node distance stats (distance between any 2 nodes in the subnet) →
Smallest Distance
Average Distance
Largest Distance
EXISTING
317.676 km
8678.125 km
18505.029 km
PROPOSED
317.676 km
8678.125 km
18505.029 km
Subnet characteristic counts →
Continents
Countries
Data Centers
Owners
Node Providers
EXISTING
5
13
13
13
13
PROPOSED
5
13
13
13
13
Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) →
The above subnet information is illustrated below, followed by a node reference table:
Map Description
Red marker represents a removed node (transparent center for overlap visibility)
Green marker represents an added node
Blue marker represents an unchanged node
Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)
Known Neurons to follow if you're too busy to keep on top of things like this
If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.
Other good neurons to follow:
Synapse (follows the LORIMER and CodeGov known neurons for Subnet Management, and is a generally well informed known neuron to follow on numerous other topics)
CodeGov (actively reviews and votes on Subnet Management proposals, and is well informed on numerous other technical topics)
WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)
This proposal replaces a node in subnet lhg73: bz73b, which appears as “Status: Active” in the IC dashboard but shows a small but consistent rate of failed blocks in the Node Provider Rewards tool, with another node from the same node provider in a different data centre in the same city, thereby having no impact on Nakamoto coefficients or target topology parameters.
That is a difficult question. So there are two types of “transactions” on the IC: updates (potentially mutating canister state), and queries (strictly read only). Queries are faster and do not go through consensus. Updates are slower and go through consensus. Obviously, both are important.
The issue is that “trustworthy node metrics” do not accurately track query handling on the node. It’s a known issue, and we’re looking for an additional “trustworthy” metric we could use for this purpose. But it’s a fairly difficult problem since any such data needs to go through consensus to be trustworthy – and queries don’t.
In the meantime, we rely on our observability stack to catch some of these issues.
We have a lot of internal alerts configured in the monitoring stack for the mainnet nodes and these internal alerts are mirrored in the public dashboard. That’s why you can see some nodes as Degraded in the public dashboard - this indicates that some alerts are firing for the nodes.
Our DRE tooling will automatically take into account node health when replacing nodes and will only pick nodes without alerts as “healthy” replacements. I hope that over time we’ll have to rely less on the internal reliability and more on the data provided by the IC itself.
Thanks for explaining @sat, this is interesting. What does the observability stack do to detect query issues (does it periodically query a sample of nodes and compare results or something)?
Out or interest, is there a reasonable explanation for why a node may be performing well for update calls but badly for queries?
The observable stack has access to raw replica metrics, so behavior when serving both updates and queries.
Although to be completely open, in this case the replica on this node was rejecting updates calls as well, with a message that ingress message has invalid expiry. You can look it up on the forum, there are many user complaints. However, this still doesn’t show up in trustworthy metrics since node was correctly creating blocks. It was just rejecting ingress.
This should obviously in the future be reflected in the node rewards to incentivise node providers to quickly resolve the issue. We’ll get there, step by step.