The NNS motion proposal is live, voting is open for the next 4 days.
I agree with this change and endorse it.
I come from a Web2 experience doing a lot of integrations, and a lot of the “pain” came from handling all the possible errors, edge cases. We could never “prepare well enough”, it was crucial to have proper alert / logging systems.
For the “best effort” system to be a success, we definitely need the “canister logging on traps” and proper handling of these fails (sleep + retry nr as mentioned).
Please kindly prioritise these before the release of wide changes in networking.
There is already work in progress to preserve and expose logs, with explicit coverage for traps.
For alerting, you can expose standard Prometheus metrics via an HTTP endpoint (e.g. here are the NNS governance canister metrics). The only thing you should be careful about is to explicitly attach timestamps to every sample, so if you hit a replica that is significantly behind the rest of the subnet you get a gap instead of an out-of-order sample.
@here - if you’re interested in this topic, we’ll have a presentation and discussion about it this Thursday.