Over the weekend of the 30th / 31st January, we spun up an additional set of infrastructure to conduct final testing of our February release (above and beyond our typical test, staging and production environments) ahead of its deployment to production. Whilst the configuration of that infrastructure is almost identical to our staging and production environments, a small change had been made to the build process - incorporating the use of a system monitoring solution called systemd in place of our existing solution, bluepill.
As a result of these changes, a minor configuration parameter was changed on the webserver configuration in the release environment to ensure that the web servers served requests correctly given the difference between the two system monitoring solutions. This change was successful, and as a result we successfully completed our pre-deployment testing process, approved the release for deployment to production, and deployed to production at 10:38pm on Sunday 31st January.
Our post-deployment testing was completed with no issues identified, and application performance through the remainder of Sunday evening and most of Monday 1st February was in line with standard performance, with no issues identified.
At about 4:00pm UTC (one of our standard peak load periods given our UK / US focus), traffic and associated load began to rapidly increase, and as a result of this sustained increase in load, at 4:08pm UTC one of our web server nodes performed an automated restart of its web server process, transferring load to other nodes automatically. At this point, the web server process on the node failed to restart successfully, reported a configuration issue, and was unable to begin serving requests. The load continued to increase, and one by one the web server nodes assumed increasingly larger loads and began a similar process restart and subsequent failure, resulting in a system wide outage by 4:09pm UTC.
Our team began investigating the issue immediately, and after an initial review / diagnosis it became clear that the issue had arisen due to a disparity between the environmental / server configurations of the release environment and the production environment (in short, the release had been deployed into our production environment without changing its configuration to reflect the usage of bluepill in lieu of systemd). As a result, when the web server processes tried to restart (for the first time since the deployment) under this significant load, the configuration parameters applied failed, and the process was unable to begin serving requests again.
Once this was tested and confirmed to be the issue, the team put a fix in place (changing the configuration back to the correct values) and began deploying the fix across each of the web server nodes. In all, it took just over 30 minutes to identify and confirm the underlying issue, and just over 15 minutes to apply the fix across the web server estate, resulting in a total downtime of 47 minutes (4:09pm UTC - 4:56pm UTC).
Our team are currently conducting a thorough post-mortem of the incident, will learn from this (preventable) downtime and will implement a series of updates to our change management, testing, deployment and infrastructure processes as a result of this outage, and we may provide further updates on this in due course.
In the interim, we wanted to communicate as openly as possible about how and why the incident occurred, and lay out a set of clear steps moving forward for how we're addressing this specific issue: