This topic is intended to capture Subnet Management activities over time for the 6pbhf subnet, providing a place to ask questions and make observations about the management of this subnet.
At the time of creating this topic the current subnet configuration is as follows:
TLDR: Offline node in Israel replaced with one in Georgia. This looks good, however a few points Iād like some clarity on before voting:
Iāve noticed that the unassigned nodes are currently on GuestOS version 3d0b3f10417fc6708e8b5d844a0bac5e86f3e17d while the subnet is running GuestOS version 6968299131311c836917f0d16d0b1b963526c9b1. Iām unclear how this is handled. Is the GuestOS version automatically updated for the unassigned node as part of joining the subnet? If so, whatās the point of deploying GuestOS versions to unassigned nodes in the first place (e.g. Proposal: 131712)?
Iām aware of cases where other types of proposals have failed due to the GuestOS version on unassigned nodes, such as when unelecting versions from the registry (e.g. below)
@Luka do you know if GuestOS version inconsistencies can be an issue during āChange Subnet Membershipā proposals (or is the unassigned nodeās GuestOS version updated automatically to reflect the subnet)?
Decentralisation Stats
Subnet node distance stats (distance between any 2 nodes in the subnet) ā
Smallest Distance
Average Distance
Largest Distance
EXISTING
117.442 km
6831.051 km
17277.995 km
PROPOSED
117.442 km
6777.392 km (-0.8%)
17277.995 km
This proposal slightly reduces decentralisation, considered purely in terms of geographic distance (and therefore thereās a slight theoretical reduction in localised disaster resilience).
Subnet characteristic counts ā
Continents
Countries
Data Centers
Owners
Node Providers
EXISTING
4
13
13
13
13
PROPOSED
4
13
13
13
13
Largest number of nodes with the same characteristic (e.g. continent, country, data center, etc.) ā
The above subnet information is illustrated below, followed by a node reference table:
Map Description
Red marker represents a removed node (transparent center for overlap visibility)
Green marker represents an added node
Blue marker represents an unchanged node
Highlighted patches represent the country the above nodes sit within (red if the country is removed, green if added, otherwise grey)
Light grey markers with yellow borders are examples of unassigned nodes that would be viable candidates for joining the subnet according to formal decentralisation coefficients (so this proposal can be viewed in the context of alternative solutions that are not being used)
Known Neurons to follow if you're too busy to keep on top of things like this
If you found this analysis helpful and would like to follow the vote of the LORIMER known neuron in the future, consider configuring LORIMER as a followee for the Subnet Management topic.
Other good neurons to follow:
CodeGov (will soon be committed to actively reviewing and voting on Subnet Management proposals based on those reviews)
WaterNeuron (the WaterNeuron DAO frequently discuss proposals like this in order to vote responsibly based on DAO consensus)
we had some issues with this long time ago when unassigned nodes were falling behind nodes in the subnets by a lot of versions. since we started keeping them one version behind nodes in subnets, we hadnāt had any issues. and yes, unassigned node is upgraded as soon as it join the subnet.
Thanks @Luka! Good to know the upgrade happens automatically. Iām still a little unclear on why replica versions get explicitly deployed to unassigned nodes.
If I understand, youāre saying that the upgrade path from replica version to replica version becomes potentially dangerous if too many intermediary version are skipped between upgrades? Iām unclear why this would be (could you clarify?). Is there a defined number of versions that is regarded as safe to skip?
The protocol changes over time and eventually two versions that are very far apart cannot be upgraded/downgraded between each other. upgrade paths we test are strictly +/-1 version.
Thanks @Luka, this info is really helpful. My thinking was that GuestOS images contain the state and logic of the protocol, but of course thereās also the API for interacting with the registry etc. which may have changed between versions. Now I understand why there can of course be breaking changes between versions (unless the +1/-1 version steps are adhered to). Makes sense. Thanks again!
+/- 1 version is not a hard requirement for unassigned nodes. theyāre anyhow upgraded to the version of the subnet before they join the subnet. in any case itās better to try to replace the node even if it potentially fails rather than keeping an already failed node in the subnet.
Hi @Luka, Iāve been pondering this some more. Iād like to get a better feel for the sorts of things that can go wrong. Are you able to point to some examples?
i think youāre mostly right about how this look like with the exception that guestos image does not contain any protocol data.
so i think we had only once the case that there were some too old unassigned nodes that were not able to update once they joined the subnet. i remember the orchestrator broke, most likely because of some registry API incompatibility. it could be that it couldnāt read the config of the subnet from the registry for example because some new field was added.
Thanks @Luka, do you happen to know the proposal? Iād be interested in learning more about this situation. You mentioned previously that the unassigned nodes are updated to the latest version before joining the subnet (rather than once theyāve joined the subnet, mentioned above). Iām still confused about how a failure can occur unless the old replica version somehow takes on some responsibility during the node swap.
Iām only asking because Iām keen to understand this
the case iām mentioning was at least two years ago so surely will not be able to find it.
letās take for example that unassigned nodes havenāt been updated for some time. for the node to know that it needs to join the subnet, it needs to read latest state from the registry. if it cannot parse the registry because the registry now has fields that cannot be parsed by that older version, it will never learn that it joined the subnet and will never be able to upgrade to subnetās version.
Okay thanks @Luka, so just to confirm - the node is responsible for initiating itās own upgrade to the appropriate subnet version (by communicating with the registry while running itās current replica version), rather than the appropriate replica version being pushed to that node at part of a āChange Subnet Membershipā proposal?
If the node joins the subnet but fails to upgrade the replica version (for the reason youāve described) presumably the proposal would still be considered executed, or is the replica upgrade synchronous and/or somehow propagated to inform the proposal result (executed/failed)?
youāre correct. each replica polls the state of the network from registry to determine which version it should run and which subnet it should join. once the node is upgraded, thereās still consensus mechanisms in order for the node to replicate the state from the rest of the subnet.
irt proposals, the success of execution of proposal merely indicates that governance canister was able to update registry successfully.
This proposal sets the notarisation delay of the subnet to 300ms, down from 600ms. The change will increase the block rate of the subnet, aimed to reduce latency of update calls.
Here are the current metrics for this subnet. A question Iāll be asking on the Subnet Management General thread (see reference below this post) is why this update is being rolled out to so many subnets at once, each with different finalisation rate and transaction profiles (e.g. peaks and troughs, whereas the canary subnet was always steady, even prior to the update). Iām wondering if this limits the representativeness of the results on that canary subnet.
@sat This node appears as āStatus: Activeā on the dashboard and currently appears as āstatus: UPā using the decentralization tool, but appeared as āalert: IC_Networking_StateSyncLoopā and āstatus: DEGRADEDā when I ran an earlier check at 06:21 UTC. Should we presume that the node has now recovered its health?
This node being āl2bzbā?
That particular node keeps going degraded and recovering.
You can check that here Node Provider Rewards
Or by fetching and analyzing trustworthy node metrics yourself.