Repeated downtime on front end and backend canister for DSocial

Uptime for IC front end of DSocial (dwqte-viaaa-aaaai-qaufq-cai) has been 99.91% for last 7 days days, 99.97% for last 30 days. Roughly 4 outages in July, 4 in June, 503 for 1-4 minutes each time.


Screenshot 2022-07-18 at 15.03.50

Now I am starting to get the same for my backend contracts (mw5fv-3qaaa-aaaaj-aacba-cai), having to build retry logic in front end to make basic query calls. Users come to homepage of DSocial and get a blank screen - backend call failed.

Are dfinity eng team aware of this? Is this just me or system wide?

BTW, this is not deployments, this is just the system running on it’s own. Canisters are running out of memory of cycles either.

Outage log file is as follows:

Event,Date-Time,Reason,Duration,"Duration (in mins.)","Monitor URL"
Up,"2022-07-18 11:39:26",OK,"1 hrs, 29 mins",89,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-07-18 11:38:27","Internal Server Error","0 hrs, 0 mins",1,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-07-13 16:07:44",OK,"115 hrs, 30 mins",6931,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-07-13 16:06:44","Internal Server Error","0 hrs, 1 mins",1,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-07-13 13:56:24",OK,"2 hrs, 10 mins",130,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-07-13 13:52:24","Internal Server Error","0 hrs, 4 mins",4,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-07-12 21:19:03",OK,"16 hrs, 33 mins",993,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-07-12 21:18:04","Not Found","0 hrs, 0 mins",1,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-07-11 16:35:49",OK,"28 hrs, 42 mins",1722,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-07-11 16:33:49","Internal Server Error","0 hrs, 2 mins",2,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-06-30 12:27:19",OK,"268 hrs, 6 mins",16087,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-06-30 12:25:19","Internal Server Error","0 hrs, 2 mins",2,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-06-17 12:19:27",OK,"312 hrs, 5 mins",18726,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-06-17 12:15:27","Internal Server Error","0 hrs, 4 mins",4,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-06-10 12:21:48",OK,"167 hrs, 53 mins",10074,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-06-10 12:19:33","Internal Server Error","0 hrs, 2 mins",2,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-06-02 16:24:17",OK,"187 hrs, 55 mins",11275,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Down,"2022-06-02 16:23:17","Internal Server Error","0 hrs, 1 mins",1,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Up,"2022-05-31 15:02:18",OK,"49 hrs, 20 mins",2961,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
Started,"2022-05-31 15:01:39",Started,"0 hrs, 0 mins",1,https://dwqte-viaaa-aaaai-qaufq-cai.raw.ic0.app/
5 Likes

I believe what you’re observing here is IC upgrades. The number of events you list for the months of June and July matches the IC’s upgrade frequency (i.e. usually once per week but we had to deploy some hotfixes last week and there were more than usual). During an IC upgrade, replicas have to be stopped and then restarted with the new binary and that results in the canisters hosted on the subnet being offline for a few minutes (also matching your observed 1-4 minutes “outage”).

7 Likes

Can’t the ic upgrade be seamless without any interruption? How can a critical system run on IC if the upgrades will cause temp outage? I’m not a developer but I guess most if not all, would want to ensure there is minimal unplanned interruption (eg. ic upgrade) to their app.

5 Likes

You make some very valid points. Indeed, the situation is not ideal, but the IC protocol poses some unique challenges that make this problem (zero downtime for canisters during IC upgrades) non-trivial to solve.

First, let’s clarify that because replicas need to restart with a new binary during an upgrade it’s a given that there’s gonna be some time that a replica is down. That’s unavoidable. However, there are potential ways to “hide” this from users and provide a better experience.

In a traditional cloud platform, (e.g. AWS), such upgrades are not affecting your service because they usually rely on some form of rolling upgrade (which means that not all machines restart at the same time). This allows your service to reply to requests during a software upgrade of the underlying platform.

In the IC, such a rolling upgrade is substantially harder to pull off. The requirements imposed by our consensus protocol (essentially a majority of replicas needs to agree on some state) make rolling upgrades, where some replicas restart before others, very hard to work under any scenario (happy case is relatively easy to imagine how it can be made to work).

That said, there are potential designs that could make a form of rolling upgrades possible. There have been ideas of having a dedicated set of replicas (for each subnet) that would be serving just queries (so they wouldn’t participate in the consensus protocol) which can be used to scale query capacity. In such a world, you could imagine that we can upgrade the consensus-replicas first, while query-replicas are still serving queries, then upgrade the query-replicas while the consensus-replicas can serve queries. This would essentially give you no down time for queries which for most cases is more than enough to keep a substantial functional part of an IC dapp online (updates would still be blocked while the consensus-replicas are restarting). Of course, this is just an idea as I said, there are still a few details that would need to be figured out but I hope it gives you an idea of what could be possible.

Let me close with a final remark: The current downtime of restarting replicas is (as observed) in the order of couple minutes. I think there’s room for improvement and these can maybe even get down to seconds in the future. Combined with potential improvements on the boundary nodes to be made aware of such cases and retry requests, we can probably improve a lot on the current UX.

10 Likes

Seconds would be a big improvement, but I think this also depends on the frequency of upgrades. It sounds like this would be the easier option to achieve (vs. special read replicas).

What about scheduling the “downtime” and providing such an API or at least announcing the timing of the replica upgrade so that apps can plan for it?

1 Like

What about scheduling the “downtime” and providing such an API or at least announcing the timing of the replica upgrade so that apps can plan for it?

The problem with this is that we might need to rollout hotfixes (and sometimes even security related hotfixes) which are unplanned and we can’t give advance notice unfortunately. AFAIK, the regular schedule is also sometimes subject to change (e.g. some subnet might be upgraded before another or the other way around), so, even for the normal weekly upgrade, I’m not sure how strict of a schedule we can commit to. It’s definitely a valid idea though and I can bring back this feedback to the team.

providing such an API or at least announcing the timing of the replica upgrade so that apps can plan for it

One approach you can follow already today is to watch out for proposals that upgrade the subnet your dapp is running on (example) and expect that once its status is “Executed”, the replicas will restart within 15 minutes or so.

4 Likes

Thanks for the replies. It seems that these downtimes have been more frequent recently. I’m not actually checking the backend automatically, but multiple times I’ve randomly opened the website and just got a blank screen to find the backend canister is down. So if I’m seeing this so will the users, and thus destroy engagement.

Is there anything that be done quickly to reduce the downtime to seconds?

If we can get down to a few seconds my retry policies will probably work and but I’m probably still going to need to migrate the front end to centralised provider unfortunately.

Thanks in advance!

It seems that these downtimes have been more frequent recently.

As I mentioned in a previous message, we had to perform some hotfixes last week and that meant more upgrades than the usual 1 per week. I’d recommend monitoring how things play out the next couple of weeks (until end of July/early August) with hopefully only planned upgrades. If you still see frequent “outages”, there might be something else that we need to take a closer look at.

Is there anything that be done quickly to reduce the downtime to seconds?

I’m afraid I can’t give a good estimate on this. I’m not super familiar with the parts that are slow during a replica restart and how they can be improved and even then it’s always a matter of other priorities for the team(s) that would need to work on these improvements. I can bring your feedback back to our internal teams though and see what we can do.

If we can get down to a few seconds my retry policies will probably work and but I’m probably still going to need to migrate the front end to centralised provider unfortunately.

I’m not sure I see how moving your front-end to a centralised provider solves the problem really. The backend would still be running on the IC, right?

@dsarlis

What happens if I deploy an upgrade to my canister wasm that starts right before a replica upgrade?

Let’s say I push out an upgrade canister and it’s in the midst of the pre-upgrade method serializing my wasm heap to stable memory when the local replica get upgraded and that replica upgrade is pushed out to my canister. Will it wait for my wasm upgrade to complete?

What about the reverse scenario where the replica upgrade is in progress and I attempt to upgrade my canister’s wasm?

1 Like

@justmythoughts

Good question!

Let’s take the first scenario. An upgrade of your canister is in the middle of being executed and there’s a replica upgrade that is about to happen. The replica will upgrade (i.e. restart with a new binary) only after having created a CUP (Catch Up Package) which is essentially a certified version of its state which happens on some regular interval (every X blocks). This can only be created after the execution round has ended, which means that your canister upgrade will have finished before the replica can create the required CUP (either in this round or some previous round) and restart.

Conversely, and that leads us to the second scenario you were asking about, if the replica upgrade starts first, since the replica has to restart (as I explained in an earlier message), it will not be available for you to submit your upgrade request and your request will likely be rejected. You’ll have to retry after the replica has finished upgrading. If you manage to submit your request right before the replica restarts, your request will be waiting in the queue until the replica restarts successfully with the new binary and then at some point process it.

4 Likes

I understand, let’s see how the next few weeks go. As for the front end migration, the reason I’ll probably need to migrate is because it going down so regularly.

People are going to DSocial.app and getting not found or server error. That’s not good :wink:

1 Like

We did a review of the configuration of boundary nodes and loosened some OS and Nginx parameters leading to a nice reduction in 500 errors.

This should help. If you get these errors in the future please check the response in the 500 error - there may be a useful message in there as to what is going wrong.

Looking at the access/error log it looks as if there might have been an upgrade affecting your canister:

2022/07/29 19:18:15 [info] 49490#49490: *30100941 client canceled stream 5 while connecting to upstream, client: 127.0.0.1, server: boundary.dfinity.network, request: "POST /api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query HTTP/2.0", upstream: "https://[2604:7e00:50:0:5000:64ff:fea3:ccaa]:8080/api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query", host: "ic0.app"
2022/07/29 16:40:08 [warn] 44895#44895: *50894397 send() to syslog failed while logging request, client: 109.29.102.10, server: boundary.dfinity.network, request: "POST /api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query HTTP/2.0", upstream: "https://[2a02:418:3002:0:5000:78ff:fe58:3650]:8080/api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query", host: "ic0.app", referrer: "https://dwqte-viaaa-aaaai-qaufq-cai.ic0.app/"
2022/07/29 16:40:08 [warn] 44895#44895: *50894397 send() to syslog failed while logging request, client: 109.29.102.10, server: boundary.dfinity.network, request: "POST /api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query HTTP/2.0", upstream: "https://[2a00:fa0:3:0:5000:2dff:fee9:b0e]:8080/api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query", host: "ic0.app", referrer: "https://dwqte-viaaa-aaaai-qaufq-cai.ic0.app/"
2022/07/29 10:52:22 [error] 31906#31906: *18676115 upstream timed out (110: Connection timed out) while connecting to upstream, client: 127.0.0.1, server: boundary.dfinity.network, request: "POST /api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query HTTP/2.0", upstream: "https://[2600:3000:6100:200:5000:36ff:fe30:93d]:8080/api/v2/canister/dwqte-viaaa-aaaai-qaufq-cai/query", host: "ic0.app"
8 Likes

EPIC

At 100% for last 7 days, let’s see how it goes. This looks like it’s fixed the issue.

5 Likes