Replica is unhealthy: WaitingForCertifiedState

lastmjs · April 25, 2024, 6:34pm

TL;DR

Can someone help me to understand what is going on in the replica when I see the error Replica is unhealthy: WaitingForCertifiedState? We have a relatively small PR that we don’t think should be causing this error, as it isn’t on other branches.

Longer Story

I could really use some help here. We have a relatively small PR that we’re trying to merge into Azle main, but one of our tests is consistently failing. This test passes just fine on other branches. We’ve even tried updating dfx to a later beta/unstable version to see if it’s a problem with dfx 0.19.0.

The problem is that our large file upload tests, which upload many 100s of MiBs of files to an Azle canister, shoots off many timers and is generally doing a lot of update calls concurrently, is now failing with this after all of the file upload requests have been performed, and while it’s waiting for an internal hashing process to finish:

Finished uploading files. Waiting for file hashing to finish...
/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:543
    throw new AgentHTTPResponseError(errorMessage, {
          ^
AgentHTTPResponseError [AgentError]: Server returned an error:
  Code: 503 (Service Unavailable)
  Body: Replica is unhealthy: WaitingForCertifiedState. Check the /api/v2/status for more information.

    at HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:543:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:540:14)
    at async HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:540:14)
    at async HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:540:14)
    at async HttpAgent.readState (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:771:22)
    at async pollForResponse (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/polling/index.ts:36:17)
    at async caller (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/actor.ts:478:29)
    at async cleanup (/home/runner/work/azle/azle/src/compiler/file_uploader/on_before_exit.ts:46:9)
    at async process.<anonymous> (/home/runner/work/azle/azle/src/compiler/file_uploader/on_before_exit.ts:21:13) {
  response: {
    ok: false,
    status: 503,
    statusText: 'Service Unavailable',
    headers: [ [Array], [Array], [Array], [Array] ]
  }
}
Error: Failed while trying to deploy canisters.
Caused by: Failed while trying to deploy canisters.
  Failed while trying to install all canisters.
    Failed to install wasm module to canister 'backend'.
      Failed to run post-install tasks
        Failed to run post-install task npx azle upload-assets backend
          The post-install task `npx azle upload-assets backend` failed with exit code 1

jennifertran · April 25, 2024, 8:28pm

This means that either you are hitting a dead node or the node is slightly behind in the subnet.

I would first what subnet the canister is on using the IC Data Dashboard and then see if there are any degraded nodes within that subnet.

Does that help?

lastmjs · April 25, 2024, 9:24pm

It’s just the local replica running with dfx start in tests in GitHub Actions

jennifertran · April 25, 2024, 10:53pm

Let me double-check and get back to you.

jennifertran · April 27, 2024, 1:22am

The replica is in some form of corrupted state. You can try to restart the replica using a clean slate using dfx start --clean.

Please let us know if that works.

Topic		Replies	Views
Occasional replica unhealthy response when deploying? Developers	2	347	May 2, 2023
Stability issue Developers	14	292	October 9, 2024
Panicked at 'Failed to serialize message: ErrorImpl { code: Message("The number can't be stored in CBOR"), offset: 0 }', /build/rs/canister_sandbox/common/src/transport.rs:285:55 Developers	6	724	March 21, 2022
Application not available - 503 Subnet issue pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae Developers Discussing	3	54	March 5, 2025
Not calling `accept_message` in inspect message not rejecting immediately in dfx 0.14.2-beta.2 Developers	7	630	July 10, 2023

Replica is unhealthy: WaitingForCertifiedState

TL;DR

Longer Story

Related topics