RCA for ICPunk website move away from IC 4 hrs prior to launch

Lol, we’re all watching this play out along with you at Dfinity. We’ve got a bunch of new data from this drop, and we have new targets to design around. I’m still pretty positive though - things slowed down quite a bit under all the traffic, but nothing actually broke

18 Likes

The “nothing actually broke” is SO incredibly important here… Many other chains have had NFT drops that brought them to their knees. This drop was “more” than the system could handle, but it didn’t die. That is very very good. Additionally, it was seemingly just the subnets that were at the highest usage that were really being brought down. Not the entire Internet Computer. Although, I clearly have less purview than the fine folks behind the curtain.

4 Likes

Thank you for this important update. Our community is eagerly waiting for the postmotem for this incident InternetComputer Status - Multiple subnets are experiencing degraded performance and user traffic throttling due to high traffic.. I hope it will be transparent analysis and clear future plan for improvement.

3 Likes

The developer perception of IC is everything. So what if “nothing actually broke”? The users were furious. From their perspective, the system did die. They couldn’t care less about subnets or whether the IC as a whole was alive and well. They couldn’t claim the punks and that was all that mattered to them.

IC needs to be transparent about what happened and what and how (and when) it is expected to be fixed.

2 Likes

Agree. Transparency is needed right now

Hey folks,

Wow. IC Punks is a crazy wild event with incredible demand, so while I am not aware of the details (or even high level), it would be far to say the following:

  1. Demand and traffic was crazy high, and while there appear to be throttling and performance issues, the subnets never failed over (a technical nuance I find important)

  2. Yes, there are engineers working tirelessly right now to diagnose, understand, and triage the issues. They will update https://status.internetcomputer.org/ when things are more stable.

  3. As I know more, I will let the community know.

I admit, I have been more focused on public roadmap transparency and community-wide designs, so I am less informed on this topic than I should be given my role, I have been letting the R&D team do their thing. Sorry, I do not know more at the moment, but it’s important to be honest when one does not know.

For transparency, this is a copy/paste of what I posted here, but I wanted folks to see my stand on it: https://www.reddit.com/r/dfinity/comments/pg31ey/icpunks_postmortem/hb9xltu?utm_source=share&utm_medium=web2x&context=3

6 Likes

It seems that the status has been updated. “Subnets are now functioning normally and traffic has returned to normal levels.”… which would be correct; because the traffic has gone away.

Now that the engineers have gotten a breather, can someone update this post with the knowledge of what happened and how (and when and what) will be need to be changed / modified so that this situation does not repeat?

1 Like

No, it was a full success. What failed was providing of acceptable user experience while being hit by a huge load.

As you might imagine no matter what tech stack or architecture you use, there’s always a maximum throughput of ingress and outgress data unless you start scaling your service (e.g. horizontally and what not). In this particular case, we are talking not about IC as a whole, but about a single subnet, being hit the hardest. But it survived without any scaling or interactions from our side (!) and as Kyle pointed out we got a lot of new interesting data, which we’ll analyze and apply the lessons. Scaling a single canister dynamically is not a trivial task, and is not a feature IC supports yet, but Rome was not built in a day.

6 Likes

Thanks, Christian , for the update. The community experience of using IC has been far less charitable than deeming it a “full success”. If this was just expected to be an experiment to get some data, I suppose one could call it a success.

While we might have gotten some interesting data, please remember that we got this data at the expense of by making 100k users, shall we say, furious. Not the most optimal nor wise way of getting data.

Secondly , if it was known that this was not going to scale apriori (as your post seems to imply), why was it not communicated to icpunks. They were tweeting at difinity, Dominic et al for at least 24+ hours. icpunks , if I may remind you, is a recipient of dfinity grant. It seems silly to give someone a grant if one is not interested in the project that is the subject of the grant.

Thirdly what is the path forward? What is the communication strategy to deal with this situation? How are we planning to address the developer community? While Rome was certainly not built in a day, ICP has been touted to be ready to take on the toughest challenges. It clearly is not, today. Think about it from this perspective. If IC cannot sustain a website traffic of 100k users, how will the community trust it with the BTC integration?

Instead of analyzing the data and applying the lessons in private, open up a little. Engage with the community explaining what just happened. Who knows, you might get pointer or two on solving issues before they happen? Distributed computing is hard; but you guys have been at this for 5 years.

Oh and talking about the lessons learnt, I am surprised that there isn’t a load generator that should be used to simulate the load. Perhaps that could be the first thing to built so that whenever the fixes do come in, they can be tested.

3 Likes

I fully agree with you on the user experience part, as I mentioned in the previous message. I cannot speak for the entire org, but I think it’s fair to assume that this drop was not supposed to run like this. I’m not even sure there was a way to predict what happens and how high the load would be (I’m sorry if I say stupid things, I’m not involved with icpunks). I merely tried to put things into the proper context and look at it from the half-full perspective. Your criticism is absolutely reasonable, but I personally had slightly different expectations: you might know that IC is still in a beta phase (as also pointed out in many places with a beta label), so I personally think hiccups like this are unavoidable in the early days of project with the complexity of IC (to be frank, I expected much more of them).

I also think your suggestions (especially regarding being more open) are perfectly valid and I believe no one at the org would disagree, and we are moving towards this, just step by step. Experiencing it first hand, it turns out many things look from the outside very easy at a high level, but then actually reveal themselves a complex interconnection of many more issues (technical challenges, deadlines, community expectations, communications, legal, etc), which you cannot approach in isolation neglecting the rest of them.

Wrt. path forward let’s just wait for more information to come.

5 Likes

Time does not offer second chances but it does offer additional opportunities.

The math kept the network together as it was supposed to. That is a damn good day.

We need to forge this network by taking the stress up to 11. Tip of the chapeau to the punks for their creative destruction.

4 Likes

Thank you Christian. I am definitely waiting for more information about this incident. Really curious about why the CORS error happened and how IC can mitigate this in the future.

About you said “IC is still in a beta phase”, I just remembered nns.ic0.app has a beta sign. I didn’t realize Genesis launch in May is actually beta version and it seems all the Dfinity articles didn’t emphasize this is beta. Do you know any ETA for the prod (official) version? What does it really mean that current version is beta? And what does Dfinity expect the Prod version to be?

2 Likes

Please see here High User Traffic Incident Retrospective - Thursday September 2, 2021 for root cause analysis.

It is awesomely professional! Thank you so much!

7 Likes

@tcpim I guess I know as much as you do :slight_smile: I remember Dom was mentioning the beta thing in one of the official Genesis videos + here you can see that the Mercury release is marked as “Beta Mainnet + Genesis” on the release schedule. I’m not aware of any ETA’s on when we drop the beta status, but considering that we entered the GA status just a bit longer than a month ago, I think it’s safe to say IC is still in its infancy.

1 Like

The reality is a little unexpected

1 Like

I see huge market potential

1 Like

Hi,

I was responsible for preparation of the site of ICPunks for the launch. While I think the launch was a big succes (9600 different principals claimed tokens), there are things that could be done better.

The launch was really wild, we did not expect such traffic. The actual amount of people trying to claim ICPunks was huge, during the launching day more than 180k people tried to reach our website.

Statistics from DFINITY also helps to see what was the route cause of the whole problem. 35k+ request per second is really high, no single server can handle such amount of requests (of course it is possible to handle such amount of trafic, however it requires a setup of several servers, loadbalancers, CDN, GeoIP and other techniques). Single server, when optimized can handle up to 10k+ requests per second.

When we are at the statistics to get a better perspective, google currently handles ~100k searches per second. Of course google searches are way more complicated than simple claiming of NFT, however we are talking about blockchain technology, which ensures that once the data was saved it will no be changed in an unauthorized way.

I understand that many people were frustrated about the launch, given the number of people trying to claim it was 1 to 18 chance of actually getting one.

I have few ideas on what can be improved on the development side to make it better for users.

  1. Right now there is no possibility to make load balancers, readonly nodes and advanced caching on IC so big launches will cause 504 errors (most users will not be able to reach website and be furious about it). We will not make the same mistake again and no launches at given time for many users are set in near future.
  2. Offload as many static data as possible to static assets, the issues that occurred on our site (broken timer, no way to click claim) was due to the fact that the site had to reach first to IC to check status and then make it possible to claim. Expect that IC will not return any data at all, or randomly failed calls
  3. Be prepared that IC will not respond from time to time. While we implemented retries for failed requests, right now it looks like a bad design choice. Automatic retries will increase the problem, instead other notifications and some easing techniques should be used.
  4. When IC returns 504 for change calls it does not mean the call has failed. During the high traffic I noticed several times that changes in data returned 504, however the change was actually made.
  5. The read and write calls are bound to the same rules, so if IC starts returning 504 it means all calls may fail.
  6. Make more notifications for users about failed calls. In other blockchains once you send your transaction it is there, it may not be processed in near future but it will not be lost. There is no similar mechanism yet in IC. Would it be possible to have different canister types? Like read-only and write-only, and synchronize data between them when it is possible. I think that we could manage much higher load of users this way. This way we could ensure that all people could put their request in queue and get information about success of failure later.

The fact that IC did not mangle the contract data, and actually enabled claiming 10k tokens even during such high load was a great success from technology point of view.

I strongly believe that given time we can develop together good practices and implementation patterns that will improve user experience for high load canisters.

9 Likes

Reality is the tail risk ( we live in the pattern of folly and bubbles ) or to say it another way … chaos from equilibrium gives models the long finger ( hello flash crash ) ….

But here …the model held, the math prevailed, the levee did not break. Damn impressive.

1 Like

Halo,

It’s been a long time since I did math again. And I have no PhD or st like that, but you say chance was 1 to 18? 1 to 18? Let’s say it’s 1 to 18, but for one event only.
I thought you clearly stated that there were some stages in the process to finally got one punk.
At least I can point out 4 stages: reach website → login to 1 of 2 wallets (for Plug it is easy, for Identity Anchor, it’s another stage with new tab to login) → claim button appear → claim your punk successfully
So for each stage/event, you can fail at any stage/event and have to start it all again. Let’s say it’s 1 to 18 for each event or maybe number of ppl narrowing through this process, whatever. So is it 1 to 18 to finally be able to claim one punk now?

Thanks.

1 Like

From my personal experience, I spam reloading like 10 tabs in Chrome, and the furthest I could reach is Claiming button appear when there was about 1k-2k punks, at the end I could not claim one. Bad luck then.

1 Like