Date: 1 May 2025
Duration: 50 minutes (15:11 – 16:01)
Pinpoint experienced a complete admin/API service outage and a partial careers site outage, lasting from 3:11 PM to 4:01 PM BST on 1 May. During this 50-minute window, users could not access the platform and were presented with error messages when attempting to log in or use any part of the service.
15:11 – Monitoring alerts triggered for a service interruption.
15:13 – Root cause identified as database connection failures.
15:20 – Contacted managed database provider; issue confirmed as infrastructure-related.
15:45 – Provider identified a disk controller failure and began working on remediation; we started a cluster migration ourselves in case of further provider issues.
15:50 – We escalated the issue with the provider due to a lack of failover activation.
16:00 – Provider migrated the database to new hardware; connectivity restored.
16:01 – Full service restoration confirmed across all components.
The root cause was a hardware failure on the part of our managed database provider. Specifically, a critical disk controller in their storage infrastructure failed, resulting in incorrect disk space reporting at the database level. This led our primary database instance to believe it had run out of storage space, and it began rejecting all connections, including those from our application servers.
Although our agreement with the provider includes premium-grade redundancy and automated failover mechanisms, these safeguards failed to activate due to an undetected fault in their failover detection logic. As a result, traffic was not redirected to the standby replica, and manual intervention from their engineering team was required to restore normal operations. Unfortunately, this intervention took longer than acceptable, contributing to the total duration of the outage.
We sincerely apologise for the disruption and appreciate your patience as we work to strengthen our platform's stability.