Replica is unhealthy: WaitingForCertifiedState

TL;DR

Can someone help me to understand what is going on in the replica when I see the error Replica is unhealthy: WaitingForCertifiedState? We have a relatively small PR that we don’t think should be causing this error, as it isn’t on other branches.

Longer Story

I could really use some help here. We have a relatively small PR that we’re trying to merge into Azle main, but one of our tests is consistently failing. This test passes just fine on other branches. We’ve even tried updating dfx to a later beta/unstable version to see if it’s a problem with dfx 0.19.0.

The problem is that our large file upload tests, which upload many 100s of MiBs of files to an Azle canister, shoots off many timers and is generally doing a lot of update calls concurrently, is now failing with this after all of the file upload requests have been performed, and while it’s waiting for an internal hashing process to finish:

Finished uploading files. Waiting for file hashing to finish...
/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:543
    throw new AgentHTTPResponseError(errorMessage, {
          ^
AgentHTTPResponseError [AgentError]: Server returned an error:
  Code: 503 (Service Unavailable)
  Body: Replica is unhealthy: WaitingForCertifiedState. Check the /api/v2/status for more information.

    at HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:543:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:540:14)
    at async HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:540:14)
    at async HttpAgent._requestAndRetry (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:540:14)
    at async HttpAgent.readState (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/agent/http/index.ts:771:22)
    at async pollForResponse (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/polling/index.ts:36:17)
    at async caller (/home/runner/work/azle/azle/node_modules/@dfinity/agent/src/actor.ts:478:29)
    at async cleanup (/home/runner/work/azle/azle/src/compiler/file_uploader/on_before_exit.ts:46:9)
    at async process.<anonymous> (/home/runner/work/azle/azle/src/compiler/file_uploader/on_before_exit.ts:21:13) {
  response: {
    ok: false,
    status: 503,
    statusText: 'Service Unavailable',
    headers: [ [Array], [Array], [Array], [Array] ]
  }
}
Error: Failed while trying to deploy canisters.
Caused by: Failed while trying to deploy canisters.
  Failed while trying to install all canisters.
    Failed to install wasm module to canister 'backend'.
      Failed to run post-install tasks
        Failed to run post-install task npx azle upload-assets backend
          The post-install task `npx azle upload-assets backend` failed with exit code 1

This means that either you are hitting a dead node or the node is slightly behind in the subnet.

I would first what subnet the canister is on using the IC Data Dashboard and then see if there are any degraded nodes within that subnet.

Does that help?

It’s just the local replica running with dfx start in tests in GitHub Actions

Let me double-check and get back to you.

The replica is in some form of corrupted state. You can try to restart the replica using a clean slate using dfx start --clean.

Please let us know if that works.