Internet Identity Incident Retrospective - January 31, 2022

Last Monday, Internet Identity was malfunctioning for about half an hour after an upgrade. The cause was a Content-Security-Policy issue that prevented browsers from making calls to the IC. The team rolled back the upgrade after 34 minutes and the fix was ready almost immediately. The team spent most of the week testing the code to make sure this doesn’t happen anymore. For more details, read on!

Impact

Internet Identity was malfunctioning for 34 minutes on January 31st 2022, from 3:01pm until 3:35pm (UTC). Creating anchors and authenticating through Internet Identity was not possible during that time.

Timeline on January 31st (UTC)

  • 14:35: upgrade proposal 42415 is submitted
  • 15:01: upgrade proposal 42415 is accepted and executed, from this point on Internet Identity is unusable
  • 15:11: Internet Identity team is made aware of the issue
  • 15:20: rollback proposal 42421 is submitted
  • 15:35: rollback proposal 42421 is accepted and executed, from here on Internet Identity is usable again

What happened?

The main issue here was that our CSP policy only allowed requests to https://identity.ic0.app while the frontend code tried to make requests to https://ic0.app. This happened because the upgrade included an update of the agent-js library which, for efficiency reasons, started making requests to https://ic0.app instead of directly to the host serving the assets (in our case https://identity.ic0.app, generally https://<canister-id>.ic0.app).

The team unfortunately didn’t make the connection with the CSP policy while reading the release notes of agent-js v0.10.2. Mistakes like this happen and usually are caught by the test-suites or manual tests playbooks; unfortunately in this case the particular way the URL rewrite (i.e. https://<canister-id>.ic0.app -> https://ic0.app) was implemented meant that it would only trigger on mainnet, i.e. with a URL of the form *.ic0.app. The tests are as production-like as possible, but one difference is that the subnets we use internally for testing do not live on ic0.app, meaning the rewrite didn’t trigger.

The outage lasted longer than we would have liked, for two main reasons.

  1. The first one is that it took the team 10 minutes to notice the issue.
  2. The team could not roll back the change as quickly as it would have wanted because most people vote through the NNS dapp, and in order to vote, one needs to authenticate through Internet Identity… However, the rollback proposal could still be voted on via the dfx CLI, which took a bit longer.

What now?

The two main takeaways here are that we need even more robust and ever more production-like testing, and that the team needs to be aware of outages as soon as they start.

Production-like testing

Since the outage, the team introduced two major changes impacting our testing capabilities.

  1. First, we now run our CI tests using the same hostname as in production, meaning the frontend code thinks it is being served on ic0.app. We made sure we could reproduce the issue on CI before pushing the fix.

  2. Second, we now read the canister ID from the canister itself – as opposed to baking it in the frontend – meaning the team can test the official Wasm canister code in a different canister on mainnet before pushing it to rdmx6-jaaaa-aaaaa-aaadq-cai (which you know as identity.ic0.app).

Over the next few weeks team will experiment with local networks (think: docker compose) to make sure we emulate the IC as best as we can.

Quicker handling of outages

As mentioned above, it took the team ten minutes to notice there was a production issue, which is not something that should happen ever again. For that reason the standard roll out procedure has been updated to make sure that new canister code is tested extensively as soon as it hits production, so that the team can start rolling back immediately if necessary. The team will also dedicate time in the coming weeks to set up automated production health checks and alerts – which already exist for “the IC” but not yet for every individual service like Internet Identity.

Conclusion

On behalf of the Internet Identity team, please accept our apologies! We know that Internet Identity is a service used by many, many of you – and us – and that it is core to lots of people’s workflow. We’re working hard on making everything as smooth as possible.

Let us know if you have any questions, feedback, or thoughts on this incident, we’ll be around to answer!

15 Likes

Thank you for the detailed post it helps us understand how dfinity teams are working internally.

I have one question tho, do you have a traffic switching when you deploy new code to production? Like you deploy first to 1% of the users and observe the performance then gradually increase till 100% or this is generally not applicable on the IC?

2 Likes

Thank you @nmattia
Have you thought about deploying a test-internet-identity canister with another principal to mainnet? This could make it easier to test in production.

1 Like

Great question!

We could definitely do that, at least for the frontend code; we’ll definitely look into this in the next few weeks. Unfortunately we don’t really have a way yet of “observing performance” since we don’t even have “access to the machine” in the typical sense.

4 Likes

That’s a great idea! And I must be terrible at expressing myself, because that’s exactly what we did, if I understand you correctly :slight_smile:

  1. Second, we now read the canister ID from the canister itself – as opposed to baking it in the frontend – meaning the team can test the official Wasm canister code in a different canister on mainnet before pushing it to rdmx6-jaaaa-aaaaa-aaadq-cai (which you know as identity.ic0.app ).
3 Likes

Oh, sorry, its my bad (english lol)
Well, it’s good that the thoughts coincided, thank you! :slight_smile:

2 Likes

I believe it’s for security reasons. As far as I understand, the insecure /api pass through in the service worker is a (hopefully mostly theoretical) attack vector, and will be phased out.

If these tests should work end-to-end, including logging in, we need support for Canister Signature from delegated (non-root) subnets. Is that finally in the pipeline then (or maybe already present and I didn’t notice)? Or will the tests happen on the root subnet?

1 Like

Yup, to clarify, we only ran limited testing on the mainnet canister clone, namely loading of assets and basic auth flow (key creation) which was enough to make sure this particular CSP issue was gone!

1 Like

So as far as I can tell the boundary node performs the redirect anyway – for security reasons. So I’m assuming the change in agent-js was made for performance reasons, to avoid that redirect.

My understanding is that the Boundary node supports /api on certified domains because, well, it did originally and we didn’t see the problem with that in time. We want to remove the redirect on the boundary node eventually for security reasons. This would break existing agents, so agent-js was updated to no longer use the certified domain, but rather https://ic0.app, and once deployed everywhere the boundary node can stop doing the dangerous redirect. But I might be missing something, this is all based on vague recollection.

2 Likes