I’m struggling with keeping my dApp on ICP running 24/7/365.
Problem is - there are constant subnet upgrades which take my canisters DOWN, often it happens multiple times a day and at multiple subnets at the same time
Symptoms: client starts to return 503 no_healthy_nodes
Downtime can last up to 5 minutes from my experience
This is pretty much unacceptable in the modern software development world. Off-chain this can be easily solved by running replicated front-end instances and using a fault tolerant database
I have absolutely no idea how to solve this on ICP. Subnets are supposed to be fault-tolerant and upgraded using rolling upgrades.
How do we run something that requires constant uptime?
What is your dapp and which subnet or subnets is it running on and how often is it going down?
This seems odd because then you are implying that most major dapps on ICP are going down constantly as well? I don’t see the nns icpswap kongswap bob taggr yral openchat etc going down all the time do you?
I would guess whatever is going on is a localized issue with possibly a new node provider that has just setup shop?
Try running uptime monitoring 24/7(I do) and query any dApp on IC every second and you gonna understand what I am talking about
Even NNS goes down from time to time because of non-rolling subnet upgrades
Incident isn’t limited to a single subnet - it’s a widespread rotten practice of upgrading subnets without doing rolling upgrades. It literally temporary takes down ALL canisters on affected subnet
You can easily miss this because downtime window is usually 2-3 minutes, but it’s still unacceptable for any serious usage.
Subnet upgrades on ICP today are not rolling: this means that all nodes in a coordinated fashion load the latest software and continue working. This has a big advantage in that compatibility issues between replicas are way less likely to occur, but has the downside that it leads to the subnet briefly not making progress during an upgrade. Concretely, these upgrades happen once a week on a subnet (see https://dashboard.internetcomputer.org/releases) and leads to a short downtime. It looks like this downtime has increased a bit lately and is now a couple of minutes.
Subnet upgrades typically happen once a week, so if you see this more regularly, then there may be another problem. If you can share when you saw this multiple times per day then we can take a better look if anything else was going on.
DFINITY is currently not working on improving subnet upgrades, but I do think different improvements can be made that I hope we’ll get to in the near future:
we could optimize the current approach further to make the downtime shorter, I imagine that we could get it in the order of 1 minute. Doing an upgrade once a week would mean 99.99% availability.
We could probably go even lower by separating Guest OS upgrades from replica binary upgrades: you could imagine that guest OS upgrades can be done rolling, while perhaps only replica binaries need to be coordinated. Then upgrade downtime could be in the seconds, so 99.999% uptime is possible.
we could separately improve how noticeable this is for users, eg by ensuring some replicas keep answering queries and accepting update calls, even though for a brief moment those messages will need to wait to get processed.
Your idea of scheduling subnet upgrades and having some API is also interesting and something we could explore further. You could imagine that subnet upgrade proposals have a time in them, and that you can query upcoming subnet upgrades.