Canister return 504, too much traffic on subnet?

Hi, this is the developer from DFinance, we launched our testnet today, it all went smoothly until a while ago, users talking about loading too slow, I tried to call our backend canister with dfx, but I got 504 error, is this because of too much traffic? I thought IC has load balance?

Canister id: Principal lf23w-ciaaa-aaaah-qaeya-cai | ic.rocks
We are in this subnet: Subnet pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae | ic.rocks

6 Likes

I believe this is a problem for your subnet, other projects reported the same issues, maybe youā€™re all on the same subnet? (entrepot & icpunks)

https://dashboard.internetcomputer.org/subnet/pjljw-kztyl-46ud4-ofrj6-nzkhm-3n4nt-wi3jt-ypmav-ijqkt-gjf66-uae

Yes I believe so, we are in the same subnet.

1 Like

gg, you broke the IC!

1 Like

Haha, stress test, find the problem early so we can grow stronger.
It seems that the current subnets architecture can not handle too much triffic.

4 Likes

Luckily the IC can scale and add nodes to the subnets :slight_smile:!

2 Likes

Weā€™ve also launched a whitelist event which involved mutiple people to login to our website icpunks.com and had exact problem. It looks like that we did a stress test :smiley:

1 Like

About 9k user participated in our testnet event, at the first several hours, itā€™s working smoothly, then it stucked.
Screen Shot 2021-08-26 at 12.52.33 AM

1 Like

So you are the culprit ;). Iā€™m glad that it happend today and not on the claiming day :D.

As far as we can tell this has to do with the fact that subnet pjljw handles by far the most queries, combined with an issue we discovered recently, where we only cache the compiled Wasm when executing updates, not queries. Meaning that before an update is executed on a canister, each query will compile the Wasm from scratch. Because of the load, some queries will be really slow, while other queries will time out before they even get a chance to execute.

The reason why this started suddenly is that we upgraded the replica version on subnet pjljw a few hours ago and and caches got purged. Why this behavior wasnā€™t noticed until now (on previous replica upgrades) is unclear.

Weā€™re working on a fix, but it may take a while to deploy. In the meantime, the more canisters on pjljw handle at least one update, the less contention among queries, so this should become better over time.

11 Likes

Thanks for the fast response. Is there anything that we can do right now, or just wait?

Make sure to run an update query on each of your canisters. (o:

More seriously though, I donā€™t think thereā€™s anything you can do. Iā€™ll try to figure out if I can find a way to run a replicated query (i.e. run a query via an ingress message) on all canisters on the subnet, to prime the cache. Actually, now that I think of it, anyone could do it.

3 Likes

We have same problem accessing our application which runs in the same subnet.

IC Drive was also affected, the app was loading super slow and our public file links were also affected. Seems itā€™s fine now.

The issue seems to still happen, I face the net::ERR_ABORTED 504 with my asset canister too right now.

Yes, it happens again

1 Like

We noticed the issue happened again. The root cause has been identified and the fix is on the way. Apologies for the inconvenience and thanks for your patience.

7 Likes

Hi Shuo, do you know when it will be fixed? Itā€™s getting hard to test anything

Thanks for asking, Iā€™ve pinged the team to see who can give an update. Sorry I canā€™t help, Iā€™m not familiar with the status.