Web Server Process Failure: System Outage

Incident Report for Pinpoint

Postmortem

What Happened

Over the weekend of the 30th / 31st January, we spun up an additional set of infrastructure to conduct final testing of our February release (above and beyond our typical test, staging and production environments) ahead of its deployment to production. Whilst the configuration of that infrastructure is almost identical to our staging and production environments, a small change had been made to the build process - incorporating the use of a system monitoring solution called systemd in place of our existing solution, bluepill.

As a result of these changes, a minor configuration parameter was changed on the webserver configuration in the release environment to ensure that the web servers served requests correctly given the difference between the two system monitoring solutions. This change was successful, and as a result we successfully completed our pre-deployment testing process, approved the release for deployment to production, and deployed to production at 10:38pm on Sunday 31st January.

Our post-deployment testing was completed with no issues identified, and application performance through the remainder of Sunday evening and most of Monday 1st February was in line with standard performance, with no issues identified.

At about 4:00pm UTC (one of our standard peak load periods given our UK / US focus), traffic and associated load began to rapidly increase, and as a result of this sustained increase in load, at 4:08pm UTC one of our web server nodes performed an automated restart of its web server process, transferring load to other nodes automatically. At this point, the web server process on the node failed to restart successfully, reported a configuration issue, and was unable to begin serving requests. The load continued to increase, and one by one the web server nodes assumed increasingly larger loads and began a similar process restart and subsequent failure, resulting in a system wide outage by 4:09pm UTC.

Our team began investigating the issue immediately, and after an initial review / diagnosis it became clear that the issue had arisen due to a disparity between the environmental / server configurations of the release environment and the production environment (in short, the release had been deployed into our production environment without changing its configuration to reflect the usage of bluepill in lieu of systemd). As a result, when the web server processes tried to restart (for the first time since the deployment) under this significant load, the configuration parameters applied failed, and the process was unable to begin serving requests again.

Once this was tested and confirmed to be the issue, the team put a fix in place (changing the configuration back to the correct values) and began deploying the fix across each of the web server nodes. In all, it took just over 30 minutes to identify and confirm the underlying issue, and just over 15 minutes to apply the fix across the web server estate, resulting in a total downtime of 47 minutes (4:09pm UTC - 4:56pm UTC).

‌

What We’re Doing Differently Moving Forward

Our team are currently conducting a thorough post-mortem of the incident, will learn from this (preventable) downtime and will implement a series of updates to our change management, testing, deployment and infrastructure processes as a result of this outage, and we may provide further updates on this in due course.

In the interim, we wanted to communicate as openly as possible about how and why the incident occurred, and lay out a set of clear steps moving forward for how we're addressing this specific issue:

We've changed the configuration parameters of the release in our production environment to reflect the correct configuration requirements for that environment.
We have scheduled a period of planned maintenance at 11:00pm UTC on Sunday 7th February to bring the production environment's configuration in-line with that of our latest release infrastructure (the system monitoring solution in place in the release environment is a better, more robust / modern solution than the one currently in place in production). This period of planned maintenance will not cause any system downtime, and this process will be tested end-to-end in our staging environment prior to completion in production.
We're instigating a new manual test process before each release and production deployment to verify / identify any configuration changes (however small) between the production environment and our staging and release environments, and no pre-deployment testing cycle will commence until this has been verified.
We're instigating a forced-restart process of each of our web server nodes (in a controlled fashion) every time we deploy. We hot-deploy all infrastructure changes so that no downtime / system outage is experienced as a result of our deployments (which are frequent), and have historically not restarted the web server processes on each node before, during or after deployment. We will change this moving forward as we can still do so without incurring any downtime, and doing so would have allowed us to identify this issue in a controlled fashion and resolve it prior to experiencing an outage like the one experienced today.

Posted Feb 01, 2021 - 19:37 UTC

Resolved

This incident has been resolved. A post mortem will be provided in due course.

Posted Feb 01, 2021 - 17:00 UTC

Investigating

We are currently investigating this issue

Posted Feb 01, 2021 - 16:20 UTC

This incident affected: Pinpoint.