Yes that is exactly what happened. The only change between the version that the NNS was running before the upgrade is the LSMT flag.
You are right that we had to push a hotfix version without giving the community time to evaluate it fully. I’m sure there will be more on that in the post-mortem.
Unfortunately that would not work. It’s important that all replicas behave the same, but restarts are non-deterministic. Any machine might crash for any reason and restart itself. What we would need is that the flag flips not at a restart, but at a version change (as they happen at the same height on all replicas). And the way we achieve this is by having a separate version.
To a degree, what version the NNS is running is orthogonal to the storage limit. The performance is not a function of the limit, but of the actual state size. The NNS subnet is a closed set of canisters, and they are not going to grow in size much, no matter what the limit is. Note that the new storage layer is already running on all other subnets, and there are no plans to change that.
As for the plan going forward, we will definitely increase it in small increments. As for timing, I can’t promise anything concrete. It’s part of the upcoming stellarator milestone which is still work in progress. The storage layer was the big blocker, but we also wanna get rid of the small blockers. Plus, like you said, there is no hurry if there are still question marks around LSMT.
It’s important that all replicas behave the same, but restarts are non-deterministic. Any machine might crash for any reason and restart itself. What we would need is that the flag flips not at a restart, but at a version change (as they happen at the same height on all replicas). And the way we achieve this is by having a separate version.
Do you think it would be feasible to introduce the concept of setup config to achieve this without requiring a version election? Could SetupOS be responsible for reading the LSMT flag (or similar config in a future scenario) and pass this as an argument during the GuestOS install (rather than GuestOS being responsible for reading certain config directly). Then an NNS function that triggers a reinstall of the existing replica after updating the subnet config could be used instead of forcing a new version into the registery and deploying it (via two big separate proposals). A subnet config update seems more desirable if it can be achieved, as it’s somewhat smaller in scope.
I suspect there’s something I’m not considering with this suggestion. I learn a lot through your responses! Thanks in advance
Good question @Lorimer and unfortunately I cannot give you a satisfying answer.
I remember the same discussions came up internally (long before LSMT) when we established the process of what to do when we wanted partial rollouts.
I am a bit out of my depth here, but the main constraint that came up is that obviously everything, including any configs need to be NNS controlled. And for that we have mainly two levers, the replica version, and registry parameters. The replica version is what we are using now, and registry parameters are a poor fit for a couple of reasons.
The first is that it would need some considerable work and complexity to introduce upgrade-only registry parameters beyond the existing one (namely the required registry version). I don’t know the details here but it was dismissed at the time by people more knowledgeable about this than me. The link of the LSMT flag to restarts is also kinda special, so any work in generalizing these parameters might translate poorly to future features anyway.
The second and probably more important reason is that the registry is really only used for permanent parameters. So the parameter itself is the feature, and we particularly use it for things where we want different parameters for different subnets. For example, most subnets have a checkpoint interval of 500, but a few have 200. That is a registry parameter, and we might change these number at any some point in the future on a per-subnet basis. But the requirements for a staged rollout like for the new storage layer are really different. During a staged rollout we want a parameter that exists for a limited amount of time, while we gain confidence in the feature. Eventually the flag becomes dead code, and we remove it in the replica itself. In the registry however, as far as I understand, we can’t really remove old flags, they just stick around forever.
The third option of introducing a setup option different from the registry likely just inherits the issues of using the registry. The same requirements to what has to be permanent will likely apply, the workload is similar if not more, and the overall usefulness going forward is questionable.
Thanks a lot @stefan.schneider, thats very useful context! Seems like a topic to potentially revisit in the future if/when it becomes relevant again. Thanks for going through some of the details