Application not available - 503 Subnet issue pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae

modclub · March 4, 2025, 5:24pm

Hi we are noticing our application https://id.decideai.xyz is not accessible for the last few hours. This is occurring on the subnet pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae, is this a known issue?

peterparker · March 4, 2025, 5:28pm

Do you mean your frontend? I can access it. That said, there was a caching issue — I had to Shift+Refresh the window to reload the resources. In incognito worked on first try.

modclub · March 4, 2025, 5:37pm

It was both frontend and backend canisters. Looks like the issue seemed to be resolved now. I was seeing a 503 error on the front end and when calling status on the canister I was receiving this message:

dfx canister --network ic status decideid
Error: Failed to get canister status for 'decideid'.
Caused by: Failed to get canister status of rlz47-aqaaa-aaaah-qdcra-cai.
Caused by: Failed to call update function 'canister_status' regarding canister 'rlz47-aqaaa-aaaah-qdcra-cai'.
Caused by: Update call (without wallet) failed.
Caused by: The replica returned an HTTP Error: Http Error: status 503 Service Unavailable, content type "text/plain; charset=utf-8", content: error: no_healthy_nodes
details: There are currently no healthy replica nodes available to handle the request. This may be due to an ongoing upgrade of the replica software in the subnet. Please try again later.

Now calling status works and I receive the results.

rbirkner · March 5, 2025, 8:52am

Hey @modclub

Thanks for reporting the issue, and sorry for the trouble! We’ve identified the cause and implemented a fix.

One of the data centers (at2) lost Internet connectivity, taking all its nodes offline, including an API boundary node. However, only IPv6 was affected—IPv4 remained functional. Since our HTTP gateways rely on health checks that detected IPv4 connectivity, they incorrectly deemed the node healthy and continued routing requests to it. Unfortunately, because the Internet Computer core is IPv6-only, those requests couldn’t be processed, causing failures.

Since HTTP gateways perform proximity-based routing, users in the US were more affected than others. Ultimately, 1 out of the 20 API boundary nodes had issues, and requests hitting that node failed. The challenge is that when loading a frontend, it’s not enough for most requests to succeed—those few failed requests could be critical (e.g., index.html or important JavaScript files), leading to a broken experience.

While previous full node outages were successfully handled, this was our first partial outage. We’re now improving our health checks to prevent similar issues in the future.

Topic		Replies	Views
Issue on subnet jtdsg-3h6gi-hs7o5-z2soi-43w3z-soyl3-ajnp3-ekni5-sw553-5kw67-nqe General	4	643	April 14, 2022
Stability issue Developers	14	292	October 9, 2024
Http status 503 no_healthy_nodes Developers	6	312	December 28, 2023
Title: Error 503 When Deploying Canister – Asset Sync Failure CDK	2	46	March 13, 2025
Canister return 504, too much traffic on subnet? Developers	29	2063	August 27, 2021

Application not available - 503 Subnet issue pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae

Related topics