Hey,
there were some threads recently about canisters that max out the available WASM memory and asset canisters that got “suddenly” deleted because they only served query calls which are (currently) not affected by the freezing_threshold.
We also have similar issues with the cycles faucet which is unavailable way too often. Some of the engineers at DFINITY pointed me towards how we monitor the canisters on the NNS, and I created a minimal example project that you might find useful.
The example project is a Motoko canister that exposes some metrics (balance, memory sizes) via an HTTP endpoint, and a Prometheus instance that regularly fetches those metrics. In addition, a few alerts are configured and there are many integrations for Prometheus to deliver those alerts.
The metrics endpoint can optionally be secured with an API key. In the future it would be useful to have a Prometheus plugin that can directly talk to the IC using an agent to allow cryptographic authentication.
Another useful avenue for future work would be to include a Grafana instance to create a nicer dashboard.
If you’re interested in monitoring canisters you should also have a look at CanisterGeek for on-chain monitoring.
Hope you find this useful and looking forward to your comments!