We got alerted that subnet lhg73 is stalled, and we discovered that the nodes in the subnet are in a crash loop. We have identified the root cause and we have already started a subnet recovery process. There is no data loss on the subnet, and it seems to be only a minor problem based on the preliminary analysis, so we expect a straightforward recovery. The whole recovery process should take up to a few hours.
We will soon submit a series of proposals.
UPDATE: In accordance with the security policy, we will disclose the details of the fix within the next 10 days, or as soon as possible. You will be able to retroactively verify build reproducibility to ensure that the deployed binaries align with the code changes.
Thanks for this announcement @Sat. Out of interest, what’s the best way of verifying the condition of the subnet? There’s nothing to indicate the problem on the subnet dashboard (other than that the subnet must now be halted as the block, transaction and cycles rates have gone to zero).
I see that there’s an announcement on the status page. Is there anything else that a reviewer can check to confirm things?
Yes, sadly I don’t have a better recommendation at the moment. In the future we could consider and discuss having a private communication channel (I’d be in favour of this, for instance, but I’m not sure of the others), where we could discuss confidential details without risking public exposure.
I’ve heard the Devs team mention this before, but it seems like interaction can’t be achieved through a multi-subnet architecture. I’m not sure, though. I’ve shared your idea with the Devs team, and after this issue, we’ll also discuss with the DFINITY Devs team to see if there’s a good solution!
Thanks @Sat, my thinking was that a stalled subnet and a halted subnet are probably indistinguishable via the dashboard. But of course I can see the timestamps for the relevant data points.
Assuming this timestamp is local time for me (BST), then the flatling of transactions occurred just before 19:00 UTC, which pre-dates the execution of the proposal that halted the subnet (2024-09-15, 18:53:11 UTC).
However, I would have expected the timestamps displayed on the dashboard to be in UTC, particularly given that UTC is used in numerous other places on the dashboard (such as the execution timestamps for proposals).
If the metrics are presented in UTC (which is what I would suspect), then what this shows me is that the transaction rate fell to zero as a result of the subnet having been halted (indicating that transactions were being processed before hand).
I can see that the finalization rate fell to practically zero much earlier in the afternoon though.
It looks like the subnet was processing transactions, but notorisation and/or finalisation was struggling. Looks like the notorisation delay for this subnet is still set to 600ms (many other subnets have had this reduced to 300ms due to performance improvements). I’m curious to find out more when the details can be released.
I’ve adopted the proposal to halt the subnet (which has already executed).
what is going on. the proposal is executed but the subnet is still down. are you guys working on another fix?
this puts a real dent when difinty claims ICP is unstoppable. we need a quick fix and then a RCA soon
The claim that ICP is unstoppable means that no jurisdiction (or even handful of jurisdictions) and/or data centres have the power to stop a subnet. In this sense, this claim is a fair portrayal of the tech. It’s obvious that this is a figure of speech though (nothing is truly unstoppable).
ICP tech moves fast. The rate of commits and progress is truly impressive. The downside of moving fast is that sometimes you break things (there’s an intentional trade-off being struck here). Moving fast means the IC is likely to get to a point sooner where many of these things will be capable of being automated. I’d love to see failover subnets, automated proposal submission for Subnet Management fixes etc. This sort of thing will make ICP even more unstoppable, but it requires lots of dev work to get to that stage (and occasionally breaking things along the way is inevitable).
Another proposal is out, for electing a new replica version with the fix
A clarification is due for this one: due to CI issues, the proposal has only one link instead of the usual two. The CI job was refusing to upload an image so we had to bypass it and didn’t have time to bypass both uploads. This should not change the verification process though, as one image is enough.
I voted to reject that IC OS eletion proposal. It didn’t provide enough information to identify the release commit that needs building to verify the build. Judging by the proposal summary, this is intentional.
the source code that was used to build this release will be exposed at the latest 10 days after the fix is rolled out to all subnets
I understand the need for this sort of proposal, and I support it in principle. But of course, I can’t verify a proposal of this sort. If I can’t verify it, I can’t cast an informed decision about it. I rejected so that this is made clear.
I would like to add that the communication around this incident is excellent and I respect and appreciate the work that DFINITY is putting in to resolve this issue.
Voted yes to adopt in order to move forward with the recovery. While I agree with Subnet Management - fuqsr (Application) - #11 by Manu and look forward to continue with this discussion on how to improve it if possible, I also agree with @lorimer 's point of view.