Post-Mortem Update: Review of the SSH Accessibility after the `lhg73` Subnet Recovery

Hello ICP Community,

Following the recent lhg73 subnet incident, our team conducted a post-mortem analysis to identify and address the root causes that impeded the recovery process. One action item was to understand why DFINITY engineers were unable to SSH into the lhg73 nodes during recovery, even after the adoption of Proposal 134623. As part of the subnet recovery process, DFINITY engineers SSH as read-only users into all subnet nodes to compare states between nodes and ensure that the recovery data is not tampered with and is consistent across all nodes.

Findings

Our investigation revealed that the primary reason for the SSH connectivity issues was the inconsistent configuration of firewall rules for the SH1 nodes compared to other DFINITY-owned data centers (DCs).

Until recently, DFINITY operated four data centers and now operates only two DCs. Due to this reduction, it has become more likely that the recovery process would be initiated from the SH1 DC, increasing the likelihood that the SSH accessibility issue becomes apparent.

Action Plan

DFINITY will submit a proposal to update the firewall settings to permit SSH connections from the SH1 DC to all mainnet nodes. DFINITY will still not have SSH access to the mainnet nodes until a proposal similar to Proposal 134623 is adopted, which allows a particular identity (SSH public key) to SSH into the nodes of a particular subnet during the recovery process. As usual, a follow-up proposal such as Proposal 134632 will revoke read-only SSH access after the recovery.

Next Steps

In the nearest future we will submit a proposal to ensure that SSH access is correctly configured for future recovery processes. This will enhance our ability to respond swiftly and effectively to any similar incidents.

Yours truly,
DFINITY DRE Team

9 Likes

Wow, good info as I didnā€™t know Dfinity operated DC. Does the reduction mean that the protocol is more robust?

Apologies, I know this is off topic but I became curious when reading the aforementioned.

1 Like

@BANG DFINITY is one of the Node Providers, and follows pretty much the same rules as the other Node Providers (minus the subnet recovery process, which DFINITY always conducts since other NPs donā€™t yet have the skills to conduct, but weā€™re working towards that as well).
See below for the full list of node providers:
https://dashboard.internetcomputer.org/providers?s=100

4 Likes

Ahh, ok ok. I knew that Dfinity operated nodes, I interpreted wrong as I thought that Dfinity operated the entire data center as oppose to nodes.

Thank you for the clarification and update to skill up NP. btw, voted yes.

2 Likes

Proposal 134921 | Tim - CodeGov

Vote: Adopt

This proposal is intended to make it possible for Dfinity to be granted SSH access to mainnet nodes via the SH1 data centre. According to the information given, this change is needed in order to allow subnet recovery to be conducted in a timely manner, and an additional proposal would still be needed before this access is put into effect.

About CodeGovā€¦

CodeGov has a team of developers who review and vote independently on the following proposal topics: IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management. The CodeGov NNS known neuron is configured to follow our reviewers on these topics and Synapse on most other topics. We strive to be a credible and reliable Followee option that votes on every proposal and every proposal topic in the NNS. We also support decentralisation of SNS projects such as WaterNeuron and KongSwap with a known neuron and credible Followees.

Learn more about CodeGov and its mission at codegov.org.

3 Likes

You are right, the terminology that was introduced at launch time (Data Center in this case) isnā€™t quite correct. It is actually a rack or a few racks in a specific Data Center. So no, itā€™s not an entire Data Center. Recommendations for better naming are welcome!

2 Likes

Proposal #134921 ā€” Zack | CodeGov

Vote: Adopted

Reason:
This is currently the best way moving forward for a fast subnet recovery. We will closely monitor the upcoming proposals as well.
Thank you for the very detailed explanation specially regarding next steps . This is a MUST read for anyone that just skimmed over this post included in the proposalā€™s summary, and is looking for a deeper understanding of the issues that lead to this.

About CodeGov (click to expand).

CodeGov has a team of developers who review and vote independently on the following proposal topics: IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management. The CodeGov NNS known neuron is configured to follow our reviewers on these topics and Synapse on most other topics. We strive to be a credible and reliable Followee option that votes on every proposal and every proposal topic in the NNS. We also support decentralization of SNS projects such as WaterNeuron and KongSwap with a known neuron and credible Followees.


Learn more about CodeGov and its mission at codegov.org.

3 Likes

How about Cloud Storage Units or Server Housing Racks. Triple C = Cloud Compute Cabinets. Cloud-Ready Racks. Enterprise Rack Solutions. Dynamic Cloud Rack Solutions. Cloud-Ready Infrastructure Modules. Enterprise Cloud Cabinets. Cloud Scalable Data Hubs.

I really like Triple Cā€™s or Cloud Compute Cabinets. I am leaning more towards names with cloud & compute in it as thats what ICP is.

T3C

1 Like

Proposal 134921 ā€“ LaCosta | CodeGov

Vote: ADOPT

The proposal updates the firewall settings to allow DFINITY to have SSH access to the mainnet nodes from SH1 DC during a subnet recovery process. Important to notice that only after a proposal similar to proposal 134623 that halts a subnet will DFINITY have SSH access to mainnet nodes being this access revoked in another proposal after the recovery process.
The payload has in the ipv6_prefixes the value 2001:4c08:2003:b09::/64 that matches the first 64 bits of the ip addresses from nodes of the SH1 DC as can be checked in the dashboard. The port 22 for SSH connections is also present in the ports field.
The payload matches the proposal summary so Iā€™ve voted to adopt.

About CodeGovā€¦

CodeGov has a team of developers who review and vote independently on the following proposal topics: IC-OS Version Election, Protocol Canister Management, Subnet Management, Node Admin, and Participant Management. The CodeGov NNS known neuron is configured to follow our reviewers on these topics and Synapse on most other topics. We strive to be a credible and reliable Followee option that votes on every proposal and every proposal topic in the NNS. We also support decentralization of SNS projects such as WaterNeuron and KongSwap with a known neuron and credible Followees.

Learn more about CodeGov and its mission at codegov.org.

3 Likes

Iā€™ve just voted to adopt on my lunch break. Iā€™ll post my full write-up after work later.

This is my first time reviewing a proposal of this particular NNS function, so Iā€™ve taken the last few days (in my spare time) reviewing the 19 proposals of the same/similar kind that have come before this one historically (UpdateFirewallRules, AddFirewallRules, SetFirewallConfig), along with the context they sit within. LGTM (more detailed review to come)ā€¦

Proposal 134921 Review | LORIMER Known Neuron

VOTE: YES

TLDR: DFINITY occasionally need to raise targeted proposals that enable them to SSH into ā€œallowedā€ nodes of a target subnet in order to establish a CUP (during disaster recovery scenarios). This proposal ensures that SH1 nodes (which are DFINITY-owned) are correctly recognised as ā€œallowedā€. Note that all subnets should have at least 1 DFINITY-owned node to facilitate subnet recovery (and slightly more for larger subnets).

The proposal payload updates the ā€œReplicaNodesā€-scoped firewall rule at position 0. Note that action 1 means ā€œAllowā€. The last time this firewall rule was updated was 2024-03-14 by Proposal: 128303. Iā€™ve diffed the payloads to confirm the config that has changed, which is just the addition of 1 Ipv6 subnet (the networking kind) ā†’

2001:4c08:2003:b09::/64

{
  "expected_hash": "8F53770034886C7B6DC6524F2CF4D1FD5B5AC104955AC0CF97B3B8DE24504FD3",
  "positions": [
    0
  ],
  "rules": [
    {
      "action": 1,
      "comment": "Firewall rules for all replica nodes",
      "direction": null,
      "ipv4_prefixes": [],
      "ipv6_prefixes": [
        "2001:4c08:2003:b09::/64",      <------ Inserted as the only change
        "2001:438:fffd:11c::/64",
        "2001:470:1:c76::/64",
        "2001:4d78:400:10a::/64",
        "2001:4d78:40d::/48",
        "2001:920:401a:1706::/64",
        "2001:920:401a:1708::/64",
        "2001:920:401a:1710::/64",
        "2401:3f00:1000:22::/64",
        "2401:3f00:1000:23::/64",
        "2401:3f00:1000:24::/64",
        "2600:2c01:21::/64",
        "2600:3000:1300:1300::/64",
        "2600:3000:6100:200::/64",
        "2600:3004:1200:1200::/56",
        "2600:3006:1400:1500::/64",
        "2600:c02:b002:15::/64",
        "2600:c0d:3002:4::/64",
        "2602:fb2b::/36",
        "2602:ffe4:801:16::/64",
        "2602:ffe4:801:17::/64",
        "2602:ffe4:801:18::/64",
        "2604:1380:4091:3000::/48",
        "2604:1380:40e1:4700::/48",
        "2604:1380:40f1:1700::/64",
        "2604:1380:45d1:bf00::/64",
        "2604:1380:45e1:a600::/48",
        "2604:1380:45f1:9400::/64",
        "2604:1380:4601:6200::/48",
        "2604:1380:4641:6100::/48",
        "2604:3fc0:2001::/48",
        "2604:3fc0:3002::/48",
        "2604:6800:258:1::/64",
        "2604:7e00:30:3::/64",
        "2604:7e00:50::/64",
        "2604:b900:4001:76::/64",
        "2607:f1d0:10:1::/64",
        "2607:f6f0:3004::/48",
        "2607:f758:1220::/64",
        "2607:f758:c300::/64",
        "2607:fb58:9005::/48",
        "2607:ff70:3:2::/64",
        "2610:190:6000:1::/64",
        "2610:190:df01:5::/64",
        "2a00:fa0:3::/48",
        "2a00:fb01:400:200::/64",
        "2a00:fb01:400::/56",
        "2a00:fc0:5000:300::/64",
        "2a01:138:900a::/48",
        "2a01:2a8:a13c:1::/64",
        "2a01:2a8:a13d:1::/64",
        "2a01:2a8:a13e:1::/64",
        "2a02:418:3002:0::/64",
        "2a02:41b:300e::/48",
        "2a02:800:2:2003::/64",
        "2a04:9dc0:0:108::/64",
        "2a05:d01c:e2c:a700::/56",
        "2a0b:21c0:b002:2::/64",
        "2a0f:cd00:0002::/56",
        "fd00:2:1:1::/64"
      ],
      "ports": [
        22,
        2497,
        4100,
        8080,
        9090,
        9091,
        9100,
        19100,
        19531
      ],
      "user": null
    }
  ],
  "scope": "ReplicaNodes"
}

All 28 nodes of the SH1 data centre are at this network address (see first portion of IP addresses here). These are indeed DFINITY nodes, provided by the ā€œDFINITY Stiftungā€ NP.


Side note: I had intended to write a utility that scans all firewall rules and pairs the network addresses with the relevant data centres and node providers, for the sake of understanding the current firewall landscape a little better. However I ran out of time, hence my vote towards the end of the voting period. I have automated notifications to avoid missing the deadline. In any case, apologies for cutting it close. I intend to do some follow up analysis when I have the time (Iā€™ll save some questions that I have until then).

1 Like