Subnet `lhg73` is stalled

sat · September 15, 2024, 6:47pm

Hello everyone.

We got alerted that subnet lhg73 is stalled, and we discovered that the nodes in the subnet are in a crash loop. We have identified the root cause and we have already started a subnet recovery process. There is no data loss on the subnet, and it seems to be only a minor problem based on the preliminary analysis, so we expect a straightforward recovery. The whole recovery process should take up to a few hours.

We will soon submit a series of proposals.

UPDATE: In accordance with the security policy, we will disclose the details of the fix within the next 10 days, or as soon as possible. You will be able to retroactively verify build reproducibility to ensure that the deployed binaries align with the code changes.

sat · September 15, 2024, 6:50pm

The first proposal is out, to halt the subnet:

ICPSwap · September 15, 2024, 7:04pm

Saluting! Thanks for your hard work!

Lorimer · September 15, 2024, 7:11pm

Thanks for this announcement @Sat. Out of interest, what’s the best way of verifying the condition of the subnet? There’s nothing to indicate the problem on the subnet dashboard (other than that the subnet must now be halted as the block, transaction and cycles rates have gone to zero).

I see that there’s an announcement on the status page. Is there anything else that a reviewer can check to confirm things?

ZackDS · September 15, 2024, 7:13pm

I guess we just vote yes as for any other hotfix and later retroactively verify it when it gets released ?
Edit: voted to adopt this halt proposal.

sat · September 15, 2024, 7:38pm

Yes, sadly I don’t have a better recommendation at the moment. In the future we could consider and discuss having a private communication channel (I’d be in favour of this, for instance, but I’m not sure of the others), where we could discuss confidential details without risking public exposure.

sat · September 15, 2024, 7:40pm

@Lorimer if you check the subnet page on the public dashboard, you can see that the subnet is stalled.

I can’t share a lot more info without risking public exposure of the security issue. Sorry.

Skyclad0bserver · September 15, 2024, 7:52pm

I think important services like ICPSwap should be mirrored across a couple subnets if possible.

ICPSwap · September 15, 2024, 8:00pm

I’ve heard the Devs team mention this before, but it seems like interaction can’t be achieved through a multi-subnet architecture. I’m not sure, though. I’ve shared your idea with the Devs team, and after this issue, we’ll also discuss with the DFINITY Devs team to see if there’s a good solution!

Lorimer · September 15, 2024, 8:09pm

Thanks @Sat, my thinking was that a stalled subnet and a halted subnet are probably indistinguishable via the dashboard. But of course I can see the timestamps for the relevant data points.

Assuming this timestamp is local time for me (BST), then the flatling of transactions occurred just before 19:00 UTC, which pre-dates the execution of the proposal that halted the subnet (2024-09-15, 18:53:11 UTC).

However, I would have expected the timestamps displayed on the dashboard to be in UTC, particularly given that UTC is used in numerous other places on the dashboard (such as the execution timestamps for proposals).

If the metrics are presented in UTC (which is what I would suspect), then what this shows me is that the transaction rate fell to zero as a result of the subnet having been halted (indicating that transactions were being processed before hand).

I can see that the finalization rate fell to practically zero much earlier in the afternoon though.

It looks like the subnet was processing transactions, but notorisation and/or finalisation was struggling. Looks like the notorisation delay for this subnet is still set to 600ms (many other subnets have had this reduced to 300ms due to performance improvements). I’m curious to find out more when the details can be released.

I’ve adopted the proposal to halt the subnet (which has already executed).

salman · September 15, 2024, 8:29pm

what is going on. the proposal is executed but the subnet is still down. are you guys working on another fix?
this puts a real dent when difinty claims ICP is unstoppable. we need a quick fix and then a RCA soon

Lorimer · September 15, 2024, 8:48pm

The subnet has been halted (that’s what the latest proposal was about). The next step is to establish a catch up package for the subnet.

Fault tolerance | Internet Computer

The claim that ICP is unstoppable means that no jurisdiction (or even handful of jurisdictions) and/or data centres have the power to stop a subnet. In this sense, this claim is a fair portrayal of the tech. It’s obvious that this is a figure of speech though (nothing is truly unstoppable).

ICP tech moves fast. The rate of commits and progress is truly impressive. The downside of moving fast is that sometimes you break things (there’s an intentional trade-off being struck here). Moving fast means the IC is likely to get to a point sooner where many of these things will be capable of being automated. I’d love to see failover subnets, automated proposal submission for Subnet Management fixes etc. This sort of thing will make ICP even more unstoppable, but it requires lots of dev work to get to that stage (and occasionally breaking things along the way is inevitable).

sat · September 15, 2024, 10:20pm

Another proposal is out, for electing a new replica version with the fix

A clarification is due for this one: due to CI issues, the proposal has only one link instead of the usual two. The CI job was refusing to upload an image so we had to bypass it and didn’t have time to bypass both uploads. This should not change the verification process though, as one image is enough.

ZackDS · September 15, 2024, 10:26pm

Voted to adopt. In accordance with the Security Patch Policy and Procedure.

sat · September 15, 2024, 10:27pm

We’re now upgrading the subnet to the version with the fix:

sat · September 15, 2024, 10:36pm

And another proposal to recover the subnet is out:

Lorimer · September 15, 2024, 10:38pm

I voted to reject that IC OS eletion proposal. It didn’t provide enough information to identify the release commit that needs building to verify the build. Judging by the proposal summary, this is intentional.

the source code that was used to build this release will be exposed at the latest 10 days after the fix is rolled out to all subnets

I understand the need for this sort of proposal, and I support it in principle. But of course, I can’t verify a proposal of this sort. If I can’t verify it, I can’t cast an informed decision about it. I rejected so that this is made clear.

I would like to add that the communication around this incident is excellent and I respect and appreciate the work that DFINITY is putting in to resolve this issue.

Lorimer · September 15, 2024, 10:43pm

Similarly, I voted to reject the CUP proposal, the reasoning for which I’ve previously discussed with DFINITY - Subnet Management - fuqsr (Application) - Governance / NNS proposal discussions - Internet Computer Developer Forum (dfinity.org)

ZackDS · September 15, 2024, 10:53pm

Voted yes to adopt in order to move forward with the recovery. While I agree with Subnet Management - fuqsr (Application) - #11 by Manu and look forward to continue with this discussion on how to improve it if possible, I also agree with @lorimer 's point of view.

sat · September 15, 2024, 11:00pm

The proposal for unhalting the subnet is out:

Topic		Replies	Views
Subnet lhg73 is stalled NNS proposal discussions Subnet-management	13	407	December 29, 2024
Subnet Management - lhg73 (Application) NNS proposal discussions nns , Governance , Subnet-management	93	816	August 22, 2025
Subnets `mpubz`, `brlsh`, and `pjljw` Incident Retrospective - Friday October 15, 2021 Developers	19	1322	October 20, 2021
Subnet Management - uzr34 (II) NNS proposal discussions nns , Governance , Subnet-management	182	1642	August 10, 2025
Updating the list of public subnets NNS proposal discussions Subnet-management	13	199	April 17, 2025

Subnet `lhg73` is stalled

Related topics