RCA for ICPunk website move away from IC 4 hrs prior to launch

Hi,

I was responsible for preparation of the site of ICPunks for the launch. While I think the launch was a big succes (9600 different principals claimed tokens), there are things that could be done better.

The launch was really wild, we did not expect such traffic. The actual amount of people trying to claim ICPunks was huge, during the launching day more than 180k people tried to reach our website.

Statistics from DFINITY also helps to see what was the route cause of the whole problem. 35k+ request per second is really high, no single server can handle such amount of requests (of course it is possible to handle such amount of trafic, however it requires a setup of several servers, loadbalancers, CDN, GeoIP and other techniques). Single server, when optimized can handle up to 10k+ requests per second.

When we are at the statistics to get a better perspective, google currently handles ~100k searches per second. Of course google searches are way more complicated than simple claiming of NFT, however we are talking about blockchain technology, which ensures that once the data was saved it will no be changed in an unauthorized way.

I understand that many people were frustrated about the launch, given the number of people trying to claim it was 1 to 18 chance of actually getting one.

I have few ideas on what can be improved on the development side to make it better for users.

  1. Right now there is no possibility to make load balancers, readonly nodes and advanced caching on IC so big launches will cause 504 errors (most users will not be able to reach website and be furious about it). We will not make the same mistake again and no launches at given time for many users are set in near future.
  2. Offload as many static data as possible to static assets, the issues that occurred on our site (broken timer, no way to click claim) was due to the fact that the site had to reach first to IC to check status and then make it possible to claim. Expect that IC will not return any data at all, or randomly failed calls
  3. Be prepared that IC will not respond from time to time. While we implemented retries for failed requests, right now it looks like a bad design choice. Automatic retries will increase the problem, instead other notifications and some easing techniques should be used.
  4. When IC returns 504 for change calls it does not mean the call has failed. During the high traffic I noticed several times that changes in data returned 504, however the change was actually made.
  5. The read and write calls are bound to the same rules, so if IC starts returning 504 it means all calls may fail.
  6. Make more notifications for users about failed calls. In other blockchains once you send your transaction it is there, it may not be processed in near future but it will not be lost. There is no similar mechanism yet in IC. Would it be possible to have different canister types? Like read-only and write-only, and synchronize data between them when it is possible. I think that we could manage much higher load of users this way. This way we could ensure that all people could put their request in queue and get information about success of failure later.

The fact that IC did not mangle the contract data, and actually enabled claiming 10k tokens even during such high load was a great success from technology point of view.

I strongly believe that given time we can develop together good practices and implementation patterns that will improve user experience for high load canisters.

9 Likes

Reality is the tail risk ( we live in the pattern of folly and bubbles ) or to say it another way … chaos from equilibrium gives models the long finger ( hello flash crash ) ….

But here …the model held, the math prevailed, the levee did not break. Damn impressive.

1 Like

Halo,

It’s been a long time since I did math again. And I have no PhD or st like that, but you say chance was 1 to 18? 1 to 18? Let’s say it’s 1 to 18, but for one event only.
I thought you clearly stated that there were some stages in the process to finally got one punk.
At least I can point out 4 stages: reach website → login to 1 of 2 wallets (for Plug it is easy, for Identity Anchor, it’s another stage with new tab to login) → claim button appear → claim your punk successfully
So for each stage/event, you can fail at any stage/event and have to start it all again. Let’s say it’s 1 to 18 for each event or maybe number of ppl narrowing through this process, whatever. So is it 1 to 18 to finally be able to claim one punk now?

Thanks.

2 Likes

From my personal experience, I spam reloading like 10 tabs in Chrome, and the furthest I could reach is Claiming button appear when there was about 1k-2k punks, at the end I could not claim one. Bad luck then.

1 Like