Hi there!
Some time ago the trustworthy node metrics were enabled on the IC Mainnet.
As a continuation of that work, and in collaboration with the node providers and the security teams within DFINITY, we are excited to announce that, after extensive discussions and preparatory work we have completed the development work and the community has adopted the proposals to open up some metrics on the nodes for the public. The public node metrics will be enabled in the upcoming HostOS rollout. You can follow the HostOS rollout by checking out the NNS proposals from this topic.
Public node metrics allow anyone to scrape and check node health without relying on the IC-API, that is maintained by DFINITY. Now anyone can check the node health, but due to caching and filtering we do on the metrics, the metrics cannot be used to break into the IC nodes. This was hard to achieve, but was a major step forward in opening up IC observability. It enables fully decentralized IC maintenance, without compromising security. We managed to do this by carefully scrutinizing the metrics that are visible publicly and also by adding a caching and filtering layer for the metrics.
What does this mean for me?
It will now be possible for anyone in the community (including you!) to check the liveness and some metrics from the mainnet nodes directly. For instance, for node uaxbw, you can get node metrics in the Prometheus text format.
HostOS node exporter metrics:
https://[2a0b:21c0:b002:2:6800:e5ff:fecc:efe4]:42372/metrics/hostos_node_exporter
GuestOS node exporter metrics:
https://[2a0b:21c0:b002:2:6800:e5ff:fecc:efe4]:42372/metrics/guestos_node_exporter
GuestOS replica metrics (not available for this node at this time since the replica service only starts when the node joins a subnet):
https://[2a0b:21c0:b002:2:6800:e5ff:fecc:efe4]:42372/metrics/guestos_replica
What's next?
The DRE team has been busy preparing and opening up a full observability stack that will be able to quickly and easily scrape metrics from all mainnet nodes:The observability stack is code complete but has not yet been tested with the entire IC Mainnet dataset. Since we haven’t had the full dataset so far, please feel free to reach out to us on GitHub (issues and discussions) if you notice any issues or would like to contribute to the project.