I’d like to bring up a slightly different error that I’ve seen: TypeError: fetch failed
This is not the same as the TypeError: failed to fetch issue mentioned in this post
This issue pops up almost daily, for the past few months (sometimes several times a day).
I have a job (not in browser) that polls a polls a canister on the IC once a minute. If I can’t reach the canister, I email myself the error.
Our infrastructure is set up such that it doesn’t really if communication with the IC fails intermittently (we just keep values in a queue and try a minute later), but the frequency of these errors is the concerning part - it is something I’ve been meaning to look into.
@NathanosDev (or boundary node team) I wanted to follow up here to see if there’s any insights on your side that might help. I received 3 of these errors on September 4th (California, US time) at 8:32PM, 10:30PM, and 11:17PM. Let me know if that helps trace down the issue.
The error count is consistently down now (back to the rate we were seeing previously), but I still am receiving these errors 2-3 times a day. The last errors (2) were received on January 11th at 13:27 and 13:28 UTC.
It would be great if you could share more details with us that help pinpoint the problem: do you know which boundary node you are hitting (IP address)? If you receive an erroneous response and the x-request-id header is present, could you share that request ID with us?
From Feb 7th, 7:17-7:31 UTC every single minute (16 straight minutes) we again saw a surge in unresponsive behavior from the boundary nodes again. Hopefully this information will help narrow things down.
These errors are coming from a standalone cron process that is using agent-js, in a us-west-2 data center, which is located in Oregon. Since I’m using agent-js, I’m not exactly sure how I’d extract HTTP headers from the request error
@kpeacock the code I’m running is using 0.15.2 of agent-js, is there a recommended way to expose the error headers and request id from the agent?
I’d imagine given the origin of the request, it’s either hitting the boundary node in Seattle (1 node) or Palo Alto (2 nodes).
During the time that you mention, we experienced extremely high-load on the boundary nodes that exhausted the boundary node resources.
We have been improving the performance of the boundary nodes and putting mitigations in place. This has helped us already to handle some surges with minimal interruptions (for example one yesterday and two earlier today). However, the third one today, was too much and lead to the problems you experienced. Sorry!
We are looking into what happened and how to handle it better. It just takes time!