Last Monday, Internet Identity was malfunctioning for about half an hour after an upgrade. The cause was a Content-Security-Policy issue that prevented browsers from making calls to the IC. The team rolled back the upgrade after 34 minutes and the fix was ready almost immediately. The team spent most of the week testing the code to make sure this doesn’t happen anymore. For more details, read on!
Impact
Internet Identity was malfunctioning for 34 minutes on January 31st 2022, from 3:01pm until 3:35pm (UTC). Creating anchors and authenticating through Internet Identity was not possible during that time.
Timeline on January 31st (UTC)
- 14:35: upgrade proposal 42415 is submitted
- 15:01: upgrade proposal 42415 is accepted and executed, from this point on Internet Identity is unusable
- 15:11: Internet Identity team is made aware of the issue
- 15:20: rollback proposal 42421 is submitted
- 15:35: rollback proposal 42421 is accepted and executed, from here on Internet Identity is usable again
What happened?
The main issue here was that our CSP policy only allowed requests to https://identity.ic0.app
while the frontend code tried to make requests to https://ic0.app
. This happened because the upgrade included an update of the agent-js
library which, for efficiency reasons, started making requests to https://ic0.app
instead of directly to the host serving the assets (in our case https://identity.ic0.app
, generally https://<canister-id>.ic0.app
).
The team unfortunately didn’t make the connection with the CSP policy while reading the release notes of agent-js
v0.10.2
. Mistakes like this happen and usually are caught by the test-suites or manual tests playbooks; unfortunately in this case the particular way the URL rewrite (i.e. https://<canister-id>.ic0.app -> https://ic0.app
) was implemented meant that it would only trigger on mainnet, i.e. with a URL of the form *.ic0.app
. The tests are as production-like as possible, but one difference is that the subnets we use internally for testing do not live on ic0.app
, meaning the rewrite didn’t trigger.
The outage lasted longer than we would have liked, for two main reasons.
- The first one is that it took the team 10 minutes to notice the issue.
- The team could not roll back the change as quickly as it would have wanted because most people vote through the NNS dapp, and in order to vote, one needs to authenticate through Internet Identity… However, the rollback proposal could still be voted on via the dfx CLI, which took a bit longer.
What now?
The two main takeaways here are that we need even more robust and ever more production-like testing, and that the team needs to be aware of outages as soon as they start.
Production-like testing
Since the outage, the team introduced two major changes impacting our testing capabilities.
-
First, we now run our CI tests using the same hostname as in production, meaning the frontend code thinks it is being served on
ic0.app
. We made sure we could reproduce the issue on CI before pushing the fix. -
Second, we now read the canister ID from the canister itself – as opposed to baking it in the frontend – meaning the team can test the official Wasm canister code in a different canister on mainnet before pushing it to
rdmx6-jaaaa-aaaaa-aaadq-cai
(which you know asidentity.ic0.app
).
Over the next few weeks team will experiment with local networks (think: docker compose) to make sure we emulate the IC as best as we can.
Quicker handling of outages
As mentioned above, it took the team ten minutes to notice there was a production issue, which is not something that should happen ever again. For that reason the standard roll out procedure has been updated to make sure that new canister code is tested extensively as soon as it hits production, so that the team can start rolling back immediately if necessary. The team will also dedicate time in the coming weeks to set up automated production health checks and alerts – which already exist for “the IC” but not yet for every individual service like Internet Identity.
Conclusion
On behalf of the Internet Identity team, please accept our apologies! We know that Internet Identity is a service used by many, many of you – and us – and that it is core to lots of people’s workflow. We’re working hard on making everything as smooth as possible.
Let us know if you have any questions, feedback, or thoughts on this incident, we’ll be around to answer!