After upgrading agent-js to v1.2 we started to observe the following error during ICP token balance fetching:
Timestamp failed to pass the watermark after retrying the configured 3 times. We cannot guarantee the integrity of the response since it could be a replay attack.
All the other tokens work fine - we’re able to fetch the balance without this error most of the time, but sometimes it appears for other tokens as well. Seems like the probability for this error to appear is something like 95% for ICP and 10% for other tokens.
Maybe, something is wrong with our latency expectations. How do we fix that?
Interesting! It may have something to do with the volume of transactions coming through the ICP canister. This is great feedback, and there are a few ways we can handle this.
Things you can do right now
You could wait a second and retry the query if it fails watermarking
You could set a higher retry time count
Things I can do:
Add additional / exponential delay to the retries
Test against the ICP ledger on mainnet and set a more reliable default
What is this “watermark protections against replay attacks / stale data”? Is it documented somewhere? I would like to understand what causes and throws the error. Is it the gateway, boundary node, replica? And under what circumstances exactly?
I see two potential places where this problem could happen.
Here blsVerify is passed instead of the actual request. Typescript doesn’t catch that, becase the request is of type any in the definition of pollForResponse.
This is an agent error. It was introduced to prevent allowing stale data through that has a timestamp before the last known block that came in as a call.
This can prevent against ordinary stale data, or a malicious MITM replay attack. Since a node can fall behind, a valid canister signature may still come back, but we know that the state may have changed in a more recent block.
In theory, another request or two with a slight delay should hit a different node, or allow the behind node to catch up.
Despite the security advantages of this feature, this has been leading to increased client errors and degrading the user experience.
@timo also called this out to me. It’s on my radar and important, but I have to get a couple other things taken care of before I can investigate fully. It’s possible this is happening more frequently with the higher load incidences, but I’ll hunt for a flaw in the logic