Hi we are noticing our application https://id.decideai.xyz is not accessible for the last few hours. This is occurring on the subnet pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae, is this a known issue?
Do you mean your frontend? I can access it. That said, there was a caching issue â I had to Shift+Refresh the window to reload the resources. In incognito worked on first try.
It was both frontend and backend canisters. Looks like the issue seemed to be resolved now. I was seeing a 503 error on the front end and when calling status on the canister I was receiving this message:
dfx canister --network ic status decideid
Error: Failed to get canister status for 'decideid'.
Caused by: Failed to get canister status of rlz47-aqaaa-aaaah-qdcra-cai.
Caused by: Failed to call update function 'canister_status' regarding canister 'rlz47-aqaaa-aaaah-qdcra-cai'.
Caused by: Update call (without wallet) failed.
Caused by: The replica returned an HTTP Error: Http Error: status 503 Service Unavailable, content type "text/plain; charset=utf-8", content: error: no_healthy_nodes
details: There are currently no healthy replica nodes available to handle the request. This may be due to an ongoing upgrade of the replica software in the subnet. Please try again later.
Now calling status works and I receive the results.
Hey @modclub
Thanks for reporting the issue, and sorry for the trouble! Weâve identified the cause and implemented a fix.
One of the data centers (at2
) lost Internet connectivity, taking all its nodes offline, including an API boundary node. However, only IPv6 was affectedâIPv4 remained functional. Since our HTTP gateways rely on health checks that detected IPv4 connectivity, they incorrectly deemed the node healthy and continued routing requests to it. Unfortunately, because the Internet Computer core is IPv6-only, those requests couldnât be processed, causing failures.
Since HTTP gateways perform proximity-based routing, users in the US were more affected than others. Ultimately, 1 out of the 20 API boundary nodes had issues, and requests hitting that node failed. The challenge is that when loading a frontend, itâs not enough for most requests to succeedâthose few failed requests could be critical (e.g., index.html
or important JavaScript files), leading to a broken experience.
While previous full node outages were successfully handled, this was our first partial outage. Weâre now improving our health checks to prevent similar issues in the future.