On June 8, 2022 two users on the developer forum reported being unable to access the IC ( “Internal Server Error. Failed to fetch response: TypeError: Failed to fetch”).
DFINITY engineers looked at the logs of the boundary node (BN) machines and found that the Marseille BN was down, and on other machines the filesystems containing the logs were close to overflowing. DFINITY Engineers reproduced the errors by filling the file system.
The overflow was caused by a Journalbeat bug JB-23627 which causes the log shipper journal-beat to go into a log spewing loop that fills up all available log disk space when it encounters “log corruption”. To fix this a workaround was deployed (a cron job to restart journalbeat) and cleared out the overflowing file system. That destroyed the existing log files and with them some of the data that would have been useful for post-mortem analysis. The Journalbeat bug is fixed in version 7.13.0, but this version is not applied to Equinix boundary nodes infrastructure (7.5). In the new VMs the version is 7.14, so the problem should not exist any more in the new BN-VMs.
Since the Cloudflare health check only exercises the replica status method which does not rely on caching (and in turn on the filesystem) this problem did not cause a failover which would have directed users to the next BN.
A parallel Plug wallet issue (triggered by a low IC finalization rate) that caused a “Too Many Requests Error” made this problem look bigger than it was.
May 22, 2022 Journalbeat log spewing problem was first encountered and manually fixed.
June 8, 2022 02:04 PST the Marseilles BN went down causing the outage. The error was not caught by the load balancer monitoring.
Jun 8 10:38AM PST An engineer noticed reports like Is the network down for anyone else? - Developers
Jun 8 11:05AM PST A Slack thread was started and investigation was launched.
Jun 8 11:05AM PST The issue was isolated and resolved by manually failing over to another boundary node. Then restarting the journalbeat service and deleting the logs filling up the file system brought the original node back.
Jun 8 1:30AM PST Dfinity was status green, after confirming operational status.
Jun 11 17:00 PST The team reproduced the problem on ic0.dev
Jun 12 21:54 PST It was discovered that the BN1 (Boston) boundary node’s monitoring was misconfigured.
Regions around the Atlantic ocean (Eastern US, Western Europe, Latin America served by BO1, LN1 and MR1) were cut off from accessing IC boundary nodes. The outage happened from around June 8, 3AM to 10:30AM PDT. All the principles served by those BNs were affected. This said the actual impact is hard to gauge because logs and metrics were lost due to the disk space overflow.
- Patches were applied manually in the past and thus forgotten in newer deployments.
- “Disk full” alerts were not configured correctly.
- The disk space for logging and caching was shared and undersized.
- Confusion around a parallel outage at Plug.
- It is hard to gauge impact from the current logging and metrics coverage.
- The heartbeat check does not cover all the components of the BN.
- The team detected the outage quickly.
- The fix was applied very quickly; 30 minutes after detection.
- The new BoundaryOS already has all the fixes incorporated.
JB-23627 + Nginx Caching + Cloudflare Failover
2 boundary nodes serving live traffic were hit by a known journal-beat bug JB2367. The fallout of the bug is excessive logging leading storage to run out of space. The exact time required to hit out-of-space conditions depends on the provisioned log partition size.
Boundary nodes employs NGINX caching for query responses. The query results from replicas are populated into an on-disk cache and have a TTL of 1 sec. Unfortunately, NGINX is designed to fail reponses with an HTTP 500 error code if the cache population fails. The cache population failed as machine ran out of disk space due to JB-23627. Most IC-Dapps are a mix of query and updates. If queries fail most dapps start to fail.
The boundary node are stateless and are configured to be fault tolerant & highly available. The fault tolerance mechanism employs a heart beat to detect service disruptions from a particular boundary node. Cloudflare initiates a heartbeat from multiple datacenters every 30 seconds. This probes for boundary node health by posting a GET request at https://ic0.app/api/v2/status. This request is passed straight through to a replica node. Hence, the journalbeat bug resulting disk full did not affect the heartbeat.
Along with the above issue the Boston BN’s monitoring was also misconfigured on Cloudflare. Boston was using the ic0.dev monitoring configuration - different from the other production nodes. This issue however didn’t cause any outage as the disk full did not trigger the failover system.
All of the above behavior was reproduced in local development environment.
Yes, in an incident on May 22, 2021. The issue was fixed at the time by manually deploying the cron job to restart journalbeat. When new nodes were deployed this change was not applied.
This was due to a process failure, as patches to the infrastructure were not version controlled but applied manually. When DFINITY engineers deployed new nodes this step was forgotten.
It would be possible to create tests for this. There are no (yet) tests for Cloudflare failover, and none for disk-full conditions. It would be possible to reproduce this in a test environment. A test could execute a command filling the disk.
- Boundary nodes deployments have to have audit trails and versioning. The Boundary VM project (which is almost complete) will enable this.
- Alerts we believed to be there were not configured. We need to test alerts after deployment.
- A configuration management system with source-controlled code and automated procedures is essential for running the IC. This is being done with the BN-VM effort.
|P0||Incident Resolved. [DONE]|
|P0||Apply mitigation for all production boundary nodes that are taking active load. [DONE]|
|P0||Disk Full alerts missing on boundary nodes|
|P0||Logrotate on boundary VMs.|
|P0||Increase /var partition space|
|P0||Update journal beat the latest version which has a fix to JB-23627|
|P1||Improvements to heartbeat mechanism. Not only replica connectivity but also BN components have to be checked.|