Subnet lhg73 is stalled

sat · December 26, 2024, 6:44pm

The issue was already mentioned in another forum thread.

Seems like this is a very unlikely corner case in the consensus code. We will submit an NNS proposal to recover the subnet, and continue debugging the root cause with the data after the incident is resolved.

sat · December 26, 2024, 9:04pm

So we just ran a simple subnet recovery with the following proposals:

halt the subnet: https://dashboard.internetcomputer.org/proposal/134605
prepare the catch-up package: https://dashboard.internetcomputer.org/proposal/134606 (which took some time due to the size of the subnet)
unhalt the subnet: https://dashboard.internetcomputer.org/proposal/134607

This recovered the subnet, although we did not resolve the root cause. The consensus team will try to get to the bottom of it, although they only have limited resources over the holidays.
There will also likely be a Post Mortem, once we identify and resolve the root cause.

timk11 · December 27, 2024, 12:47am

Voted to adopt proposals 134605, 134606 and 134607. Critical fix as explained above, already executed. Great work @sat and team on getting this happening so fast!

Lorimer · December 27, 2024, 4:41am

sat · December 27, 2024, 1:50pm

Here is a follow-up (related) forum thread:

Mar · December 27, 2024, 5:20pm

It seems to be stalled again.

ICPSwap · December 27, 2024, 5:30pm

Subnet ‘lhg73’ stalled again within 24 hours. Any insights on this?

sat · December 27, 2024, 6:04pm

Yes, we recovered the subnet yesterday without fixing the root cause, and just when the fix for the root cause was about to be rolled out to the subnet, it got stuck again. Bad luck.

sat · December 27, 2024, 6:05pm

We started the recovery process again. Recovery should be faster this time.

sat · December 27, 2024, 7:22pm

The lhg73 subnet is back up, and running the version with the fix for the root cause. We’ll proceed with the rollout of the fix to the other subnets.

Mar · December 27, 2024, 9:26pm

For what it’s worth, this subnet is unusable for us, latency is too high and unpredictable for a social network. As a result, we will be moving to another one, with all the inconveniences that it creates. I’m not sure if this is the expected behavior.

timk11 · December 28, 2024, 9:01am

Voted to adopt proposals 134623, 134629 and 134632. Critical hotfix for subnet lhg73 having stalled again, already executed pending further investigation into cause as explained above.

Lorimer · December 28, 2024, 9:19am

LaCosta · December 29, 2024, 5:21pm

Voted to adopt proposals 134623, 134629 and 134632.
This subnet stalled not long ago and was recovered without fixing the root cause and just before this was fixed it stalled again. The issue was discussed here and the following IC OS Version Election proposals 134608 and 134609 were made and have been executed to the issue should be solved. Great work @sat and team.

Topic		Replies	Views
Subnet ‘lhg73’ stalled again within 24 hours. Any insights on this? Developers Functional-Programming	1	62	December 27, 2024
Subnet 'lhg73' MIEPs Suddenly Dropped to 0 - Request for Guidance Developers Functional-Programming	5	138	December 27, 2024
Subnet `lhg73` is stalled NNS proposal discussions Subnet-management	38	731	September 18, 2024
Post-mortem: lgh73 stalls due to cycles reservation validation bug Developers	0	99	July 9, 2024
The SNS subnet is down SNS Project Governance	8	897	November 13, 2023

Subnet lhg73 is stalled

Related topics