Known 503 error

Incident Report for Pinpoint

Postmortem

Date: 1 May 2025

Duration: 50 minutes (15:11 – 16:01)

What Happened

Pinpoint experienced a complete admin/API service outage and a partial careers site outage, lasting from 3:11 PM to 4:01 PM BST on 1 May. During this 50-minute window, users could not access the platform and were presented with error messages when attempting to log in or use any part of the service.

Timeline (BST)

15:11 – Monitoring alerts triggered for a service interruption.
15:13 – Root cause identified as database connection failures.
15:20 – Contacted managed database provider; issue confirmed as infrastructure-related.
15:45 – Provider identified a disk controller failure and began working on remediation; we started a cluster migration ourselves in case of further provider issues.
15:50 – We escalated the issue with the provider due to a lack of failover activation.
16:00 – Provider migrated the database to new hardware; connectivity restored.
16:01 – Full service restoration confirmed across all components.

Why It Happened

The root cause was a hardware failure on the part of our managed database provider. Specifically, a critical disk controller in their storage infrastructure failed, resulting in incorrect disk space reporting at the database level. This led our primary database instance to believe it had run out of storage space, and it began rejecting all connections, including those from our application servers.

Although our agreement with the provider includes premium-grade redundancy and automated failover mechanisms, these safeguards failed to activate due to an undetected fault in their failover detection logic. As a result, traffic was not redirected to the standby replica, and manual intervention from their engineering team was required to restore normal operations. Unfortunately, this intervention took longer than acceptable, contributing to the total duration of the outage.

What We’re Doing About It

  • Hardware Migration Completed: Our provider has successfully migrated our primary database to a new, healthy hardware environment. Service has been fully restored and is operating normally.
  • Evaluating New Providers: We chose our current provider for their strong reputation and reliability, but in light of this incident, we’re reassessing our options to ensure our infrastructure meets the highest standards of resilience and support.
  • Working with Current Provider: We’re working with our current provider’s engineering and account teams to understand why their failover system did not function as promised and to ensure such a failure does not recur prior to us migrating to an alternative provider.

We sincerely apologise for the disruption and appreciate your patience as we work to strengthen our platform's stability.

Posted May 02, 2025 - 15:22 UTC

Resolved

This has been resolved. A full post mortem will follow soon.
Posted May 01, 2025 - 16:19 UTC

Monitoring

A fix has been rolled out. All services are currently operational and we are continuing to monitor.
Posted May 01, 2025 - 15:21 UTC

Identified

The issue has been identified and a fix is being implemented
Posted May 01, 2025 - 15:04 UTC

Update

We have identified an issue with our primary database cluster and are working with our infrastructure provider to determine the root cause.
Posted May 01, 2025 - 14:45 UTC

Investigating

We are currently investigating the issue.
Posted May 01, 2025 - 14:31 UTC
This incident affected: Pinpoint.