Threshold Key Derivation - VetKD / VetKey - Message did not complete execution and certification within the replica defined timeout

Hey Guys,

this is Dominic from diode.io we’re using Threshold Key Derivation aka VetKeys in production and have deployed some 50+ canisters now with the capability.

One thing we’re seeing quite regularly are timeout messages like this during key derivation:

Message did not complete execution and certification within the replica defined timeout.

Specifically when calling VetKD.system_api.vetkd_derive_key in this line: https://github.com/diodechain/zone_availability_canister/blob/master/src/MetaData.mo#L40

@kristofer apart from implementing retry logic - is there anything we can do so that this works more often during the first try?

Cheers!

to devs: Maybe the ability to expand the timeout requirement if the underlying structure is that slow computationally?

I am looking into this, stay tuned.

Dominic, you are seeing the error message in the frontend, yes? Making a call to the canister that then in turn makes the call to vetkd_derive_key()?

Are you using @dfinity/agent as the library you are using to make the canister call? Or are you using the VetKey KeyManager?

The error appear in our Desktop app Diode Collab, so yes in the “frontend” the agent library we use is the elixir client: icp_agent/lib/icp_agent.ex at main · diodechain/icp_agent · GitHub

But the error is generated an returned from within the IC. E.g. you can see that error message string originating here: ic/rs/http_endpoints/public/src/call/call_v3.rs at 297e165f3f241a3ed27b0f39fb9f0d03568a2ace · dfinity/ic · GitHub

Our Canister function is really isolated to the key derivation and not doing much else:

The entry point is in https://github.com/diodechain/zone_availability_canister/blob/master/src/ZoneAvailabilityCanister.mo:

  public shared (msg) func derive_vet_protector_key(transport_public_key : Blob, target_public_key : Blob) : async ?Blob {
    assert_membership(msg.caller);
    ignore await request_topup_if_low();
    await MetaData.derive_vet_protector_key(meta_data, transport_public_key, target_public_key);
  };

And then calling into https://github.com/diodechain/zone_availability_canister/blob/master/src/MetaData.mo:

  public func derive_vet_protector_key(_meta_data : MetaData, transport_public_key : Blob, target_public_key : Blob) : async ?Blob {
    let result = await (with cycles = 26_153_846_153) VetKD.system_api.vetkd_derive_key({
      input = "meta_data_encrpytion_key";
      context = target_public_key;
      transport_public_key = transport_public_key;
      key_id = { curve = #bls12_381_g2; name = "key_1" };
    });
    ?result.encrypted_key;
  };

Hope that helps.

Thanks @dominicletz, this is very helpful! Likely you see the error because you have implemented the retry functionality differently in the Elixir agent, compared to, for instance, the JS agent.

When a v3 synchronous call times out, the agent should retry. Either by falling back to the v2 polling behaviour or by issuing the same call again. I am not 100% sure, yet. I am having an active discussion with the core team about what is the expected/best practice. I will pick up the conversation on Monday and get back to you with more details. In the meantime, if you plan on doing weekend hacking, I would recommend looking closer at how the JS agent implements its retry behaviour and closer mimick that.

The timeout in itself is another question. Should it happen, and if so, how often? I will dig deeper around this as well. But, given that your canisters are likely not deployed on the fiduciary subnet, cross subnet calls need to be made when deriving vetkeys. Delays can occur there.

1 Like

Currently, the implementation at @dfinity/agent behaves in these ways when making an update call:

  • if the response from the IC has status 202, it falls back to read_state polling (source)
  • if the response from the IC has a status code different from 200 or 202, or the request fails for other reasons (e.g. network connection), it retries for the configured retryTimes times following the configured backoffStrategy (with exponential backoff strategy as default) (source)
  • if the response from the IC has status 404 and the request was made to the /api/v3/..., it falls back to calling the /api/v2/... endpoint (source)
  • if the response from the IC contains Invalid request expiry in the body, it syncs time and retries (source)
  • in all the other cases, it throws an AgentError with a specific code and kind that you can programmatically distinguish based on your logics

@dominicletz I think you should try calling the read_state endpoint after you receive a replica timeout.
As an alternative, since delays are expected when dealing with cross subnet requests as @kristofer said, you should try making your update call as an asynchronous call directly.

3 Likes

GM Dominic, here is some additional information, shared by @rbirkner.

When the “sync” call endpoint was introduced, there was the decision to not wait “forever” for the call to complete, but to set some upper limit. Keeping the connection open and waiting for the response occupies resources (at the HTTP gateway, API BN, and the replica) and could be used for a DoS. The 10s were chosen as a tradeoff between when the vast majority of update calls complete and not having too many open connections.

If the execution of the call finishes before the 10s are up, the replica returns a 200 with all information. If it takes longer then the replica returns a 202 (this is what it would return immediately for an async call) and the agent is expected to start polling.

1 Like

Amazing! Looking into this now