Hey guys,
This afternoon (around 3pm) and for around ~20-30mins, a lot of our canisters returned the famous “no healthy nodes” error.
This is complicated for trust, when a lot of time you want to do a demo of your product build on dfinity, on canister is down for some random reason (Bob is killing your subnet, or a bad production release kill your subnet nodes…)
We are facing a lot of issues since ~2/3 weeks. How can we help to enhance dfinity stability ?
The subnet upgraded to a new version around 3pm. So a short period of time - usually around 1 minute - of inaccessibility is expected. (The error message is bad, this will be made more informative shortly.) Such a state for 20-30 minutes is completely unexpected.
Update: Our observability stack noticed a downtime for 3-5 minutes at 3pm, related to the subnet upgrade. So longer than the ~1 minute I wrote above but quite far from 20-30 minutes you write above. Could the excess time maybe be explained by some caching or so on your side?
UPDATE: All errors we saw were between 15:08 and 15:12.
I can see logs at 15h17 with same error msg on our side.
Btw, this morning gold dao dashboard is down with again same error : no healthy nodes and some users are complaining on telegram about that.
This is also likely related to a subnet upgrade proposal. We’ve been looking into this in a bit more detail and there was a performance regression in updates taking them from 1 minute to 3-5 minutes right now. The issue has been identified and should be resolved soon – which means updates should be faster again.
The current update mechanism for subnets unfortunately requires some downtime. Making that time short again (back to ~1 minute where it used to be) and providing better responses, which is what is being worked on and can be provided short term, helps to some extent, but it won’t make this entirely go away. A complete solution requires deeper changes on the protocol side and thus has a longer timeframe.
Caused by: The replica returned an HTTP Error: Http Error: status 503 Service Unavailable, content type “text/plain; charset=utf-8”, content: no_healthy_nodes
Also had a lot of :
Error: Failed update call.
Caused by: The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0xf3b28252d42ec51d449871b80c6d09ad09e77c84b36ae2441f8a6359b6239326 timed out waiting to start executing., error code None
Running command via dfx. Only way to solve those errors here was to play with compute allocation.
This is what i am talking about when i talk about stability issue. Since 3 weeks i never had one days without a problem of stability, related with subnet, unheathly nodes, or whatever…
Failed to call add_to_whitelist: CertifiedReject(RejectResponse { reject_code: SysTransient, reject_message: “Ingress message 0x3bc46bfe44b966123eb6331e11410199076d8c5bcc142735a4d41f58c6cf5bff timed out waiting to start executing.”, error_code: None })
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
This is an error from the boundary nodes: The boundary node doesn’t see any healthy replica node within the destination subnet. This happens during a subnet upgrade (about once a week) and is independent of the “subnet load” issue.
Failed to call store_file: CertifiedReject(RejectResponse { reject_code: SysTransient, reject_message: “Ingress message 0xd0ea1f8392825b88e9f21955ca4445b69a57540b894a52986d7e35f128b96a8e timed out waiting to start executing.”, error_code: None })
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
I have not been able to use subnet e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe practically in the last couple of days
Facing the same error on subnet opn46-zyspe-hhmyp-4zu6u-7sbrh-dok77-m7dch-im62f-vyimr-a3n2c-4ae when trying to send an update call to canister 2f5ll-gqaaa-aaaak-qcfuq-cai
Error: Failed update call.
Caused by: The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0xe462c7652d33f6ce7398b5b1cf42821295235012597356a10136d040bc8168f0 timed out waiting to start executing., error code None
This started around 15 minutes ago and is still lasting.
Just got error by calling authentication https://identity.ic0.app/#authorize: Internal Server Error: HttpError(Http Error: status 503 Service Unavailable, content type “text/plain; charset=utf-8”, content: no_healthy_nodes
)