Is it ture that only 13 nodes in one subnet is enough?

The answer is NO.

But @dfinity and the team thinks it is enough.

Is this alien tech?

@dom always saying, ICP never stop. The fact is that that is a lie.

2 Likes

I am very interested to read the post mortem on this one.

I also think deep analysis of subnet size would be very beneficial, as it seems somewhat nebulous and unclear what the appropriate size of a subnet should be for a given use case.

And I agree that saying that ICP has never gone down or never goes down might not be very accurate…I’m not sure what has been meant by that in the past, as individual subnets can and do go down or degrade in performance.

1 Like

As far as I can remember, this shouldn’t be the first time.
I hope ICP fundamentally solves the rigid, fixed-node subnet architecture.

How do you think about the rigid, fixed-node subnet architecture.?

1 Like

The subnet stalled due to a software bug in the replica implementation. A fix has been prepared, is currently being tested and should be rolled out to the subnet next. More details will be provided in a post mortem in the next few days, right now the priority is on recovering the affected subnet.

3 Likes

I’m glad to see that the team responded promptly.
However, this does not fundamentally solve the problem.
The subnet may still go down for other reasons next time.

And to expand a little bit: since the problem is in the replica implementation (instead of a number of nodes spontaneously going offline or becoming malicious) having more nodes in the subnet would not help since all (not sure about ‘all’, but certainly too many to make progress) replicas run into the same bug

4 Likes

Is it possible to abandon the rigid structure of fixed nodes?
I am not talking about the number of nodes, but whether the protocol can detect down nodes and re-introduce healthy nodes to form a subnet?

1 Like

I agree with your high-level intent and i think it makes sense.

While I disagree with your proposed implementation, I agree with the high-level goals of reliability and improved self-healing. I believe most people would support these goals. I suspect that knowing others share your high-level intentions was the main point you wanted to convey.

I do have some bias:

  1. Almost (if not all) incidents comes from a bug being being introduced via a proposal to a subnet. Note that a proposal is VOTED on. so once a proposal is blessed, it gets deployed to all the nodes in a subnet as the new blessed version.

  2. Subnet size would not help this.

  3. I agree with the high-level intent though. The goal is reliability. Are there any historical incidents that could have been self-detected and self-fixed by the subnet (or NNS) itself without human interaction? Are automatic rollbacks helpful? Would that be possible? If it is possible, would it hav been wise? All great questions I do not know the answer to.

2 Likes

After a problem occurs, it is important to find the direct cause and solve it.

However, if you do not prevent it from happening again, the same problem will happen again.

As for this problem, has it been solved fundamentally?

Absolutely not! Will it happen again? Of course!

2 Likes

YES, as predicted, It happened again.

You can’t imagine that 3 out of a 13-node subnet are down.

This happened in ICP.

Just to be clear: this had nothing to do with the size of the subnet. It was a bug in the implementation of the protocol. Post mortem report will explain the exact cause.

1 Like