Hello ICP Community,
Following the recent lhg73
subnet incident, our team conducted a post-mortem analysis to identify and address the root causes that impeded the recovery process. One action item was to understand why DFINITY engineers were unable to SSH into the lhg73
nodes during recovery, even after the adoption of Proposal 134623. As part of the subnet recovery process, DFINITY engineers SSH as read-only users into all subnet nodes to compare states between nodes and ensure that the recovery data is not tampered with and is consistent across all nodes.
Findings
Our investigation revealed that the primary reason for the SSH connectivity issues was the inconsistent configuration of firewall rules for the SH1 nodes compared to other DFINITY-owned data centers (DCs).
Until recently, DFINITY operated four data centers and now operates only two DCs. Due to this reduction, it has become more likely that the recovery process would be initiated from the SH1 DC, increasing the likelihood that the SSH accessibility issue becomes apparent.
Action Plan
DFINITY will submit a proposal to update the firewall settings to permit SSH connections from the SH1 DC to all mainnet nodes. DFINITY will still not have SSH access to the mainnet nodes until a proposal similar to Proposal 134623 is adopted, which allows a particular identity (SSH public key) to SSH into the nodes of a particular subnet during the recovery process. As usual, a follow-up proposal such as Proposal 134632 will revoke read-only SSH access after the recovery.
Next Steps
In the nearest future we will submit a proposal to ensure that SSH access is correctly configured for future recovery processes. This will enhance our ability to respond swiftly and effectively to any similar incidents.
Yours truly,
DFINITY DRE Team