RCA for ICPunk website move away from IC 4 hrs prior to launch

Sad to hear that ICPunks were forced to take website hosting from IC to their own server; because, apparently, of performance issues associated with too many users signing up. ICPunks are one of the recipient s from DFINITY foundation grants.

This was after they had tweeted at difinity et al yesterday about the impending stress test.

Are we listening?

Is there a plan for providing an RCA of what happened?

4 Likes

Currently, the boundary nodes pass every request through to a canister. A large drop like this is a huge amount of traffic that traditional servers employing the most battle-tested caching strategies and CDN’s still crash under. Think about all the websites that buckled under the pressure of the GPU launch last year. There will be a point when the IC ecosystem can handle launches like this, but the ICPunks team made an extremely reasonable call to deliver their website using a traditional stack.

The Punks themselves and all the server logic are still on the IC, and we will still be seeing thousands of NFT’s changing ownership in the next couple hours, so we will still get a nice stress test out of this launch

1 Like

I think that we all understand that it is extremely early in the ICP launch cycle. That is NOT the issue. It is also NOT an issue about ICPunks making the decision which is their best interests.

The issue is what warning did WE give to the recipient of dfinity foundation grant that this might not scale?

What other developer community would now need to watch their backs for website development on IC?

Why did this stress test not scale on the IC, technically and what, when, how will this be fixed?

Website development on IC is a key linchpin for total unstoppable dapps( remember uniswap website thing?)

3 Likes

The stress test was a failure.

The icpunk developer’s feedback (source – discord server, annoucements) :
“By the way – our website works pretty good. The Internet Computer just stopped responding and is very laggy”

Lobbing comments against icpunk devs will be a waste of time. We should be take the feedback (The Internet Computer just stopped responding and is very laggy) with all seriousness; but not to throw in the towel. Let’s do the RCA and comeback with what went wrong, what needs fixing and how and when will it be fixed.

1 Like

Lol, we’re all watching this play out along with you at Dfinity. We’ve got a bunch of new data from this drop, and we have new targets to design around. I’m still pretty positive though - things slowed down quite a bit under all the traffic, but nothing actually broke

18 Likes

The “nothing actually broke” is SO incredibly important here… Many other chains have had NFT drops that brought them to their knees. This drop was “more” than the system could handle, but it didn’t die. That is very very good. Additionally, it was seemingly just the subnets that were at the highest usage that were really being brought down. Not the entire Internet Computer. Although, I clearly have less purview than the fine folks behind the curtain.

4 Likes

Thank you for this important update. Our community is eagerly waiting for the postmotem for this incident InternetComputer Status - Multiple subnets are experiencing degraded performance and user traffic throttling due to high traffic.. I hope it will be transparent analysis and clear future plan for improvement.

3 Likes

The developer perception of IC is everything. So what if “nothing actually broke”? The users were furious. From their perspective, the system did die. They couldn’t care less about subnets or whether the IC as a whole was alive and well. They couldn’t claim the punks and that was all that mattered to them.

IC needs to be transparent about what happened and what and how (and when) it is expected to be fixed.

2 Likes

Agree. Transparency is needed right now

Hey folks,

Wow. IC Punks is a crazy wild event with incredible demand, so while I am not aware of the details (or even high level), it would be far to say the following:

  1. Demand and traffic was crazy high, and while there appear to be throttling and performance issues, the subnets never failed over (a technical nuance I find important)

  2. Yes, there are engineers working tirelessly right now to diagnose, understand, and triage the issues. They will update https://status.internetcomputer.org/ when things are more stable.

  3. As I know more, I will let the community know.

I admit, I have been more focused on public roadmap transparency and community-wide designs, so I am less informed on this topic than I should be given my role, I have been letting the R&D team do their thing. Sorry, I do not know more at the moment, but it’s important to be honest when one does not know.

For transparency, this is a copy/paste of what I posted here, but I wanted folks to see my stand on it: https://www.reddit.com/r/dfinity/comments/pg31ey/icpunks_postmortem/hb9xltu?utm_source=share&utm_medium=web2x&context=3

6 Likes

It seems that the status has been updated. “Subnets are now functioning normally and traffic has returned to normal levels.”… which would be correct; because the traffic has gone away.

Now that the engineers have gotten a breather, can someone update this post with the knowledge of what happened and how (and when and what) will be need to be changed / modified so that this situation does not repeat?

1 Like

No, it was a full success. What failed was providing of acceptable user experience while being hit by a huge load.

As you might imagine no matter what tech stack or architecture you use, there’s always a maximum throughput of ingress and outgress data unless you start scaling your service (e.g. horizontally and what not). In this particular case, we are talking not about IC as a whole, but about a single subnet, being hit the hardest. But it survived without any scaling or interactions from our side (!) and as Kyle pointed out we got a lot of new interesting data, which we’ll analyze and apply the lessons. Scaling a single canister dynamically is not a trivial task, and is not a feature IC supports yet, but Rome was not built in a day.

6 Likes

Thanks, Christian , for the update. The community experience of using IC has been far less charitable than deeming it a “full success”. If this was just expected to be an experiment to get some data, I suppose one could call it a success.

While we might have gotten some interesting data, please remember that we got this data at the expense of by making 100k users, shall we say, furious. Not the most optimal nor wise way of getting data.

Secondly , if it was known that this was not going to scale apriori (as your post seems to imply), why was it not communicated to icpunks. They were tweeting at difinity, Dominic et al for at least 24+ hours. icpunks , if I may remind you, is a recipient of dfinity grant. It seems silly to give someone a grant if one is not interested in the project that is the subject of the grant.

Thirdly what is the path forward? What is the communication strategy to deal with this situation? How are we planning to address the developer community? While Rome was certainly not built in a day, ICP has been touted to be ready to take on the toughest challenges. It clearly is not, today. Think about it from this perspective. If IC cannot sustain a website traffic of 100k users, how will the community trust it with the BTC integration?

Instead of analyzing the data and applying the lessons in private, open up a little. Engage with the community explaining what just happened. Who knows, you might get pointer or two on solving issues before they happen? Distributed computing is hard; but you guys have been at this for 5 years.

Oh and talking about the lessons learnt, I am surprised that there isn’t a load generator that should be used to simulate the load. Perhaps that could be the first thing to built so that whenever the fixes do come in, they can be tested.

3 Likes

I fully agree with you on the user experience part, as I mentioned in the previous message. I cannot speak for the entire org, but I think it’s fair to assume that this drop was not supposed to run like this. I’m not even sure there was a way to predict what happens and how high the load would be (I’m sorry if I say stupid things, I’m not involved with icpunks). I merely tried to put things into the proper context and look at it from the half-full perspective. Your criticism is absolutely reasonable, but I personally had slightly different expectations: you might know that IC is still in a beta phase (as also pointed out in many places with a beta label), so I personally think hiccups like this are unavoidable in the early days of project with the complexity of IC (to be frank, I expected much more of them).

I also think your suggestions (especially regarding being more open) are perfectly valid and I believe no one at the org would disagree, and we are moving towards this, just step by step. Experiencing it first hand, it turns out many things look from the outside very easy at a high level, but then actually reveal themselves a complex interconnection of many more issues (technical challenges, deadlines, community expectations, communications, legal, etc), which you cannot approach in isolation neglecting the rest of them.

Wrt. path forward let’s just wait for more information to come.

5 Likes

Time does not offer second chances but it does offer additional opportunities.

The math kept the network together as it was supposed to. That is a damn good day.

We need to forge this network by taking the stress up to 11. Tip of the chapeau to the punks for their creative destruction.

4 Likes

Thank you Christian. I am definitely waiting for more information about this incident. Really curious about why the CORS error happened and how IC can mitigate this in the future.

About you said “IC is still in a beta phase”, I just remembered nns.ic0.app has a beta sign. I didn’t realize Genesis launch in May is actually beta version and it seems all the Dfinity articles didn’t emphasize this is beta. Do you know any ETA for the prod (official) version? What does it really mean that current version is beta? And what does Dfinity expect the Prod version to be?

2 Likes

Please see here High User Traffic Incident Retrospective - Thursday September 2, 2021 for root cause analysis.

It is awesomely professional! Thank you so much!

7 Likes

@tcpim I guess I know as much as you do :slight_smile: I remember Dom was mentioning the beta thing in one of the official Genesis videos + here you can see that the Mercury release is marked as “Beta Mainnet + Genesis” on the release schedule. I’m not aware of any ETA’s on when we drop the beta status, but considering that we entered the GA status just a bit longer than a month ago, I think it’s safe to say IC is still in its infancy.

1 Like

The reality is a little unexpected

1 Like

I see huge market potential

1 Like