How We Stabilized a £2M/Day E-Commerce Checkout That Was Losing 8% of Orders
The Situation
A UK e-commerce company with £2M in daily transaction volume was silently losing 8% of checkout completions. The issue had existed for 11 months. Three investigations had found nothing. Customer complaints were attributed to "user error."
The Problem
The checkout had been patched by five contractors over three years. The payment processor integration used an unofficial workaround that silently failed 8% of the time — no errors logged, no alerts fired, success responses returned while payments failed asynchronously.
What We Did
Started with humans, not code. Interviews revealed the workaround was a Black Friday 2023 emergency fix left permanently in place. Week 1: Added observability only — no code changes. The async failure pattern appeared within 48 hours. Week 2–3: Replaced the workaround with the processor's documented webhook approach. Small, tested, deployed Wednesday morning with rollback ready. Week 4: Comprehensive integration tests and architecture documentation.
The Result
Checkout completion: 92% → 99.4% in 72 hours. Estimated recovery: £160,000/day. Fix took 3 days. Investigation took 1 week.
Silent failures are the most expensive kind — the system looked healthy because nobody had instrumented the right things.