504 (Gateway Time-out) while backfilling data

I was backfilling data (several hundred MB) to the IC, running two parallel processes backfilling data to two different canisters on this subnet. Between 9-9:30am UTC and received these errors :point_down:

I don’t know exactly when the first error occurred as I was tabbed away at that time, but when I came back to my script I saw the following errors:

On process one, the backfill had halted with this error

Error: Server returned an error:
  Code: 504 (Gateway Time-out)
  Body: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.21.3</center>
</body>
</html>

Shortly after, on process two, the backfill had halted with this error

Error: Request timed out after 300000 msec:
  Request ID: e708d5d5fd8d5c7767d75ee50547f3f1039406268229dc1cfb3556847ad0b222
  Request status: processing

I will definitely account for retries next time :sweat_smile: , but am wondering why this may have occurred, and what is the frequency of these 500-like errors.

Hi @icme,
Can you give some detail on the process? What is it written in? How are you running it? Which agent are you using? Does it make a large request or many smaller requests?

The second snippet you have there looks like it’s coming from an agent. Maybe timeout after waiting for an update to complete.

From the boundary node metrics, it looks like a 504 coming from one specific node:

1 Like

Hi @raymondk, thanks for digging and finding the chart!

Motoko

Locally, with a node script and and old version of the agent - I think 0.11.1

With each API call, I was batch sending 500 metadata records at a time (each record not large at all, you can see the size of each data record by performing a query here), and then within a single call inserting all 500 records into my “db”.

As an estimate, I’d say each batch update call takes anywhere between 6-20 seconds to complete.

One thing to note is that when inserting/updating data, under the hood the CanDB client performs a combination of query and update calls in order to target the correct canister with the metadata being inserted, so if it was a single node causing issues it’s definitely possible that a single node could have caused the update to fail. An interesting thing to note however is that these 2 different errors appeared on two different canisters, so my location must have been querying the same boundary node for both requests.

Apologies, I wasn’t clear. The graph shows a single “replica node” (not a boundary node) returning 504s - the error is just being bubbled up by the boundary node.

If you were making multiple calls within the same subnet, it’s likely you were getting errors from the same replica node even if the calls were to different canisters.

1 Like