ICP Uptime Guarantees

Hello there.

I’m struggling with keeping my dApp on ICP running 24/7/365.

Problem is - there are constant subnet upgrades which take my canisters DOWN, often it happens multiple times a day and at multiple subnets at the same time

Symptoms: client starts to return 503 no_healthy_nodes

Downtime can last up to 5 minutes from my experience

This is pretty much unacceptable in the modern software development world. Off-chain this can be easily solved by running replicated front-end instances and using a fault tolerant database

I have absolutely no idea how to solve this on ICP. Subnets are supposed to be fault-tolerant and upgraded using rolling upgrades.

How do we run something that requires constant uptime?

7 Likes

If rolling upgrades are not possible can we at least have some kind of API to fetch “scheduled maintenance upgrades” and their downtime windows?

So dApps could prepare for downtime. For example I could disable ledger operations inside dApp during the scheduled downtime windows

4 Likes

What is your dapp and which subnet or subnets is it running on and how often is it going down?

This seems odd because then you are implying that most major dapps on ICP are going down constantly as well? I don’t see the nns icpswap kongswap bob taggr yral openchat etc going down all the time do you?

I would guess whatever is going on is a localized issue with possibly a new node provider that has just setup shop?

1 Like

Try running uptime monitoring 24/7(I do) and query any dApp on IC every second and you gonna understand what I am talking about :slight_smile:

Even NNS goes down from time to time because of non-rolling subnet upgrades

Incident isn’t limited to a single subnet - it’s a widespread rotten practice of upgrading subnets without doing rolling upgrades. It literally temporary takes down ALL canisters on affected subnet

You can easily miss this because downtime window is usually 2-3 minutes, but it’s still unacceptable for any serious usage.

My specific subnet I’m complaining about is https://dashboard.internetcomputer.org/network/subnets/nl6hn-ja4yw-wvmpy-3z2jx-ymc34-pisx3-3cp5z-3oj4a-qzzny-jbsv3-4qe

but it’s not limited just to this subnet, the same I can observe on https://dashboard.internetcomputer.org/network/subnets/snjp4-xlbw4-mnbog-ddwy6-6ckfd-2w5a2-eipqo-7l436-pxqkh-l6fuv-vae

Btw this downtime is -never- reflected on ANY IC dashboards, to catch this u actually have to have your own monitoring in place(which I do)

When this happens I get 503 no_healthy_nodes, and other subnets are also start going down in a similar way… idk what’s up

I know because I’m always devving my dApp and can often see stuff becoming temporary unavailable in dev console(+ Sentry uptime monitoring reports)

1 Like

But really I’d like someone from @dfinity answer when are we going to see rolling upgrades for subnets?

Otherwise not a single Enterprise will ever host on ICP, uptime is very important.

Off-chain even hobby developers can effortlessly get higher uptime than we currently have on-chain :frowning:

1 Like

I believe you. Hopefully someone like @Manu can address you and give you an answer here to this complaint…

ORE TOGGLE NULL . persistence.

1 Like

Subnet upgrades on ICP today are not rolling: this means that all nodes in a coordinated fashion load the latest software and continue working. This has a big advantage in that compatibility issues between replicas are way less likely to occur, but has the downside that it leads to the subnet briefly not making progress during an upgrade. Concretely, these upgrades happen once a week on a subnet (see https://dashboard.internetcomputer.org/releases) and leads to a short downtime. It looks like this downtime has increased a bit lately and is now a couple of minutes.

Subnet upgrades typically happen once a week, so if you see this more regularly, then there may be another problem. If you can share when you saw this multiple times per day then we can take a better look if anything else was going on.

DFINITY is currently not working on improving subnet upgrades, but I do think different improvements can be made that I hope we’ll get to in the near future:

  • we could optimize the current approach further to make the downtime shorter, I imagine that we could get it in the order of 1 minute. Doing an upgrade once a week would mean 99.99% availability.
  • We could probably go even lower by separating Guest OS upgrades from replica binary upgrades: you could imagine that guest OS upgrades can be done rolling, while perhaps only replica binaries need to be coordinated. Then upgrade downtime could be in the seconds, so 99.999% uptime is possible.
  • we could separately improve how noticeable this is for users, eg by ensuring some replicas keep answering queries and accepting update calls, even though for a brief moment those messages will need to wait to get processed.
  • Your idea of scheduling subnet upgrades and having some API is also interesting and something we could explore further. You could imagine that subnet upgrade proposals have a time in them, and that you can query upcoming subnet upgrades.
6 Likes

Then it must be not just subnet upgrades I’m seeing…

One prominent downtime I saw(across multiple subnets) was yesterday(01 oct) ~3:20 PM (UTC+4)

there were multiple 503 errors with no_healthy_nodes(js agent, front-end)

1 Like