One potential path to make progress on this front would be along the lines of
- let replicas somehow expose the checkpoint hashes that they have
- in case of an incident that requires recovery, introduce a proposal that lets replicas on a subnet create a special checkpoint at a certain height and stop there
With that, in case of some stall or crash loop or so, we could submit a proposal that lets replicas take a checkpoint at the latest computable state, which users can then see because the checkpoint hashes are exposed, and then finally a proposal can set a new recovery CUP with that state hash.