Stability issue

Gwojda · October 2, 2024, 2:51pm

Hey guys,
This afternoon (around 3pm) and for around ~20-30mins, a lot of our canisters returned the famous “no healthy nodes” error.
This is complicated for trust, when a lot of time you want to do a demo of your product build on dfinity, on canister is down for some random reason (Bob is killing your subnet, or a bad production release kill your subnet nodes…)
We are facing a lot of issues since ~2/3 weeks. How can we help to enhance dfinity stability ?

Canisters where we had an issue were :

obapm-2iaaa-aaaak-qcgca-cai ( subnet opn46-zyspe-hhmyp-4zu6u-7sbrh-dok77-m7dch-im62f-vyimr-a3n2c-4ae )
m45be-jaaaa-aaaak-qcgnq-cai ( subnet opn46-zyspe-hhmyp-4zu6u-7sbrh-dok77-m7dch-im62f-vyimr-a3n2c-4ae )
So i imagine it was directly related with this subnet.

Thanks,
Gautier

bjoern · October 2, 2024, 3:03pm

The subnet upgraded to a new version around 3pm. So a short period of time - usually around 1 minute - of inaccessibility is expected. (The error message is bad, this will be made more informative shortly.) Such a state for 20-30 minutes is completely unexpected.

bjoern · October 2, 2024, 3:19pm

Update: Our observability stack noticed a downtime for 3-5 minutes at 3pm, related to the subnet upgrade. So longer than the ~1 minute I wrote above but quite far from 20-30 minutes you write above. Could the excess time maybe be explained by some caching or so on your side?

UPDATE: All errors we saw were between 15:08 and 15:12.

Gwojda · October 3, 2024, 7:33am

I can see logs at 15h17 with same error msg on our side.
Btw, this morning gold dao dashboard is down with again same error : no healthy nodes and some users are complaining on telegram about that.

(at 9am23 this morning)

My post here is not about how many minutes exactly the canister was unreachable, but more to find solutions to avoid those downtime.

canister id impacted this morning : rbsh4-yyaaa-aaaal-qdigq-cai, subnet : e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe

More and more people are coming now on dfinity application, and stability is important key is we want massive adoption.

frederico02 · October 3, 2024, 11:08am

Something is clearly going wrong. I’ve seen others report this issue in different telegram chats for different projects.

bjoern · October 3, 2024, 12:21pm

This is also likely related to a subnet upgrade proposal. We’ve been looking into this in a bit more detail and there was a performance regression in updates taking them from 1 minute to 3-5 minutes right now. The issue has been identified and should be resolved soon – which means updates should be faster again.

The current update mechanism for subnets unfortunately requires some downtime. Making that time short again (back to ~1 minute where it used to be) and providing better responses, which is what is being worked on and can be provided short term, helps to some extent, but it won’t make this entirely go away. A complete solution requires deeper changes on the protocol side and thus has a longer timeframe.

Gwojda · October 4, 2024, 8:30am

Caused by: The replica returned an HTTP Error: Http Error: status 503 Service Unavailable, content type “text/plain; charset=utf-8”, content: no_healthy_nodes

https://dashboard.internetcomputer.org/subnet/opn46-zyspe-hhmyp-4zu6u-7sbrh-dok77-m7dch-im62f-vyimr-a3n2c-4ae

While deploying migration script in production.

Gwojda · October 4, 2024, 8:31am

Also had a lot of :
Error: Failed update call.
Caused by: The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0xf3b28252d42ec51d449871b80c6d09ad09e77c84b36ae2441f8a6359b6239326 timed out waiting to start executing., error code None

Running command via dfx. Only way to solve those errors here was to play with compute allocation.

This is what i am talking about when i talk about stability issue. Since 3 weeks i never had one days without a problem of stability, related with subnet, unheathly nodes, or whatever…

Samer · October 4, 2024, 11:11am

I am experiencing similar issues on subnet e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe

Samer · October 6, 2024, 8:42am

Failed to call add_to_whitelist: CertifiedReject(RejectResponse { reject_code: SysTransient, reject_message: “Ingress message 0x3bc46bfe44b966123eb6331e11410199076d8c5bcc142735a4d41f58c6cf5bff timed out waiting to start executing.”, error_code: None })
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

subnet e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe

rbirkner · October 7, 2024, 6:04am

This is an error from the boundary nodes: The boundary node doesn’t see any healthy replica node within the destination subnet. This happens during a subnet upgrade (about once a week) and is independent of the “subnet load” issue.

Samer · October 7, 2024, 9:33am

Failed to call store_file: CertifiedReject(RejectResponse { reject_code: SysTransient, reject_message: “Ingress message 0xd0ea1f8392825b88e9f21955ca4445b69a57540b894a52986d7e35f128b96a8e timed out waiting to start executing.”, error_code: None })
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

I have not been able to use subnet e66qm-3cydn-nkf4i-ml4rb-4ro6o-srm5s-x5hwq-hnprz-3meqp-s7vks-5qe practically in the last couple of days

Samer · October 7, 2024, 9:34am

Is that the bob.fun subnet?

Dustin · October 8, 2024, 3:28pm

Facing the same error on subnet opn46-zyspe-hhmyp-4zu6u-7sbrh-dok77-m7dch-im62f-vyimr-a3n2c-4ae when trying to send an update call to canister 2f5ll-gqaaa-aaaak-qcfuq-cai

Error: Failed update call.
Caused by: The replica returned a rejection error: reject code SysTransient, reject message Ingress message 0xe462c7652d33f6ce7398b5b1cf42821295235012597356a10136d040bc8168f0 timed out waiting to start executing., error code None

This started around 15 minutes ago and is still lasting.

bytesun · October 9, 2024, 3:18pm

Just got error by calling authentication https://identity.ic0.app/#authorize: Internal Server Error: HttpError(Http Error: status 503 Service Unavailable, content type “text/plain; charset=utf-8”, content: no_healthy_nodes
)

Topic		Replies	Views
Issue on subnet jtdsg-3h6gi-hs7o5-z2soi-43w3z-soyl3-ajnp3-ekni5-sw553-5kw67-nqe General	4	643	April 14, 2022
Application not available - 503 Subnet issue pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae Developers Discussing	3	50	March 5, 2025
Canister unreachable Developers	2	39	October 1, 2024
Http status 503 no_healthy_nodes Developers	6	308	December 28, 2023
Bone.fun down because of its subnet Developers	6	274	January 5, 2025

Stability issue

Related topics