Embedding wasm - dfx 0.17.0 crashing where previous version works

Just hit this issue with dfx 0.16.1. Previous deploys (using this same version of dfx) have succeeded.

Here were my steps (I’m using dfxvm):

  1. I briefly upgraded from dfx 0.16.1 to 0.18.0 in order to deploy a canister on the same subnet as another canister.
  2. I downgraded back to dfx 0.16.1, and attempted to deploy a new wasm for an existing canister. I received this error:

Failed during wasm installation call: The replica returned a replica error: reject code CanisterError, reject message Canister <canister_id> trapped explicitly: canister_pre_upgrade attempted with outstanding message callbacks (try stopping the canister before upgrade), error code None

At first I freaked out thinking my canister had trapped in the pre upgrade hook, but I don’t think this is the case based on the message. I also am using Motoko stable for all of my types, so serialization is being done in the background by the language. Finally, none of these stable types were changed in the upgrade…so no data was lost (I verified this)

  1. After ensuring nothing had been lost, I tried upgrading my canister again and received a different error this time (looks similar to others in this thread)

Failed during wasm installation call: Candid returned an error: input: 4449444c036d016c01cedfa0a804026d7b010003_204b7cb7e28621a6f1110913a33e11455d407f584d3cfe91edf6c12c75b6bd00a2204f7deaf907cea359eda7caa340b204e1558cf451da3fe7582a4cb010a57290d020f52735197797ef2a9e1d2b5d7ed8c13bcd1f41fc65a4731742c12b87f2346bc3
table: type table0 = vec table1
type table1 = record { 1_158_164_430 : table2 }
type table2 = vec nat8
wire_type: record { 1_158_164_430 : table2 }, expect_type: vec nat8

Again, both of these deployment attempts come when using dfx 0.16.1 after downgrading from dfx 0.18.0 back to 0.16.1 using dfxvm.

This error means that your canister is still awaiting another call and can’t be upgraded yet. This should not have something to do with the dfx version

1_158_164_430 is didc encode hash, so I think this is a problem with the install call and it seems to be trying to use the chunked installer. But if you downgrade to 0.16.1 this shouldn’t happen. What does dfx --version report? Did the downgrade work?

As Jordan and Severin pointed out, the candid decoding error record { 1_158_164_430 : table2 } should be about the type chunk_hash = record { hash: blob;} in the management canister API.

The Conjecture

Adam: install_chunked_code has gone through some rocky API evolution and the error is likely related to this.

Let me elaborate a bit about the circumstance.

At the time of dfx v0.16.0 release, IC replica implemented the new install_chunked_code API which didn’t conform the specification. And dfx/agent-rs at that time added the support for such “wrong” replica implementation.

To not break users who have been using the feature of large wasm module with dfx/agent-rs, we decided to keeping the support of both the “wrong” and “correct” install_chunked_code API on the replica side.

The replica fix has been deployed to the mainnet on 04/05/2024. That was later than the release of dfx v0.19.0. So the dfx bundled replica has never had the fix (v0.16.0 ~ v0.19.0).

So it is possible that some client side (agent/CDK) install_chunked_code calls conformed the specification but failed against the mainnet (before 04/05/2024) or local replica (dfx v0.16.0 ~ v0.19.0).

Action

We, the SDK team, has already:

  • Updated the dfx bundled replica, which includes the fix as deployed on the mainnet.
  • Updated dfx and agent-rs to conform the specification.
  • Released Rust CDK (v0.13.2), which adds support for chunk uploads conforming to the specification (no ic-cdk version has the incorrect support).

If building dfx from source code is feasible, you can try it now and see if the errors disappear.

You may also choose to wait for the upcoming dfx release next week.

1 Like

From our side, we were never using the chunked code uploads (our wasm has always been under 2MB). Maybe that changed during the last commit, but we never intentionally used any chunking functionality :man_shrugging:

Is there any way to downgrade dfx to get around this issue? I’d prefer not to need to upgrade dfx to the latest version unless necessary.

We added the support for installing large wasm (over 2MiB) in dfx v0.16.0. Since then, dfx automatically using the chunked upload when the wasm module is greater than 1.85 MiB. So it might be the case that your wasm is larger than 1.85MiB that trigger the chunked upload.

To completely avoid the chunked upload, you may have to downgrade to a version before dfx v0.16.0, e.g. v0.15.0.

P.S. I strongly suggest to use the latest dfx as much as possible. If you need a new feature from a later release, it might be more difficult to upgrade from a very old version.

:disappointed: I think the way this was fixed on the backend to protect large wasm module support then sort of invalidates using dfx 0.16.0-0.19.0 for us even though we weren’t using the chunked wasm endpoint, as we can’t deploy any updates to our canister when using these versions.

It turned out that over the past few months our wasm has grown over 2MB and we didn’t notice that it was using the chunked upload endpoint.

Huge thanks to @lwshang that immediately reached out to me and hopped on a call to unblock us by suggesting we use the dfx wasm gzip option for lowering the size of the wasm.

That’s rockstar :guitar: developer customer service!

Although we’re not sure what the exact source of the deployment bug is, with the gzipped wasm we just successfully deployed the canister upgrade and are unblocked :tada:

1 Like

I finally managed to reproduce the error locally. So I believe I found the real cause of the issue.

Bug

In ic-utils crate, there was a bug in InstallBuilder::call_and_wait_with_progress(). It couldn’t handle the case of non-empty chunk store.

dfx v0.16.0 to v0.19.0 depended on this buggy ic-utils when install wasm larger than 1.85 MiB .

So for the canister installation failures discussed in this forum thread, it’s likely that the chunk store was not empty before the installation.

Fix

The merged agent-rs PR and sdk PR fixed the issue. I verified that the dfx built from sdk master branch can handle such cases properly.

EDITED:

I opened a PR which adds a test to make sure that we fixed the real issue.
I reproduced the error in CI by downgrading agent-rs to a previous buggy version.

The upcoming dfx release next week will certainly have the fix.

Before that, there is an easy remedy to unblock your current work.


Check

You can check if your canister’s chunk store is empty by this command (replace the canister id with yours):

dfx canister call --ic aaaaa-aa stored_chunks '(record { canister_id = principal "bkyz2-fmaaa-aaaaa-qaaaq-cai" })'

The result will be like:

(
  vec {
    record {
      1_158_164_430 = blob "\96\a2\96\d2\24\f2\85\c6\7b\ee\93\c3\0f\8a\30\91\57\f0\da\a3\5d\c5\b8\7e\41\0b\78\63\0a\09\cf\c7";
    };
  },
)

A non-empty vec means that the chunk store is non-empty.

Remedy

If the check above turns out to be positive, it is easy to fix it.

You only need to execute this command (replace the canister id with yours):

dfx canister call --ic aaaaa-aa clear_chunk_store '(record { canister_id = principal "bkyz2-fmaaa-aaaaa-qaaaq-cai" })'

Then you should be able to install your canister.


Thanks for all your patience!

3 Likes