This topic is intended to capture Subnet Management activities over time for the w4rem subnet, providing a place to ask questions and make observations about the management of this subnet.
DFINITY will submit a proposal today to reduce the notarization delay on the subnet, w4rem
, similar to what has happened on other subnets in recent weeks (you can find all details in this forum thread).
The table below shows the trustworthy metrics failure rate for nodes on the w4rem subnet.
The has been a step change on the 28th of October. The failure rate for all nodes on this subnet increased to unacceptable high levels. Rewards for some of the node providers on this subnet will be deducted if this trend continues.
It looks like something on the subnet or on protocol level changed because all the node on this subnet are experiencing the same symptoms. Even the nodes in Belguim’s failure rate increased overnight from 0.08% to 2.24%?
It looks also if the effect is much greater on nodes far away from the “majority” - Costa Rica (48%), Australia (19%), Singapore(14%), South Africa(11%). Could this be a network latency issue?
Do you have any idea what is going on here?
@dsharifi @Lorimer @sat - your thoughts please
Nice observations I noted a little while ago when the notarisation delay was reduced on a subnet that it caused one of the existing nodes to need swapping out of the subnet because it was unable to keep up with the new block rate. Essentially it’s performance hadn’t changed, but relative to the other nodes that were now producing blocks faster, the node looked much worse and was considered degraded.
In summary, the reduction in notarisation delay that has occurred now on all subnets can result in a situation where node performance has not reduced in absolute terms, but has reduced in relative terms.
Does that make sense?
@Lorimer Your reply doesn’t make sense:
The changes you referred to on this subnet have been made on Sep-12 - this is something else.
Maybe the Dfinity engineers can point the node providers in the right direction.
Regardless, we can’t have subnet or protocol settings that causes 4 out of 13 nodes not be able to keep up with the new block rate.
4 out of 13 node providers on this subnet will be penalised due to a block failure rate of more that 10% and this looks like a protocol issue.
Please see the node id’s for your reference:
The point I’m making isn’t unique to one type of performance improvement proposal. Improvements are being made all the time. My point is simply that this can expose nodes that are unable to keep up with the rest. It’s happened numerous times before.
I accept that this may not be the case here. I haven’t looked into this specific case.
From the 28th on, we started deploying https://dashboard.internetcomputer.org/release/75dd48c38f296fc907c269263f96633fa8a29d0e to subnets. However, this version landed on w4rem
on 2024-10-31, 9:21:39 AM UTC
. And I don’t see the error rate spiking on this subnet on the 31st compared to the 30th. The error rate spiked the day after the subnet upgrade to the new version - so on the 1st of Nov.
It could still be the case that this new version introduces an issue or makes the problems more likely, but it certainly isn’t obvious to me.