AICase studies

The AI Disasters Companies Quietly Rolled Back — and What They All Had in Common

Behind every "AI-powered" press release sits a stack of quiet disasters nobody posts on LinkedIn — the chatbot that recommended a competitor, the email that implied a refund policy that does not exist, the personalisation engine that rediscovered last year's bias. This post walks through the failure modes that keep repeating, the shared root cause (it is almost never the model), and the one question that separates "AI roadmap" from "expensive demo": where is your evaluation plan?

Siddharth PuriFebruary 12, 20268 min read

AI & Future of Work

The AI Disasters Companies Quietly Rolled Back — and What They All Had in Common

February 12, 2026 · 8 min read · Siddharth Puri

No one tweets about the AI feature that shipped, embarrassed the company for a week, and got quietly rolled back at 2 AM on a Thursday. No one writes a LinkedIn post saying "we burned six months and $300k and then turned it off." So let me, because the pattern keeps repeating, and it is not actually that subtle once you have seen it enough times.

The classic failure modes

Chatbots that confidently recommend competitor products because the training data included reviews
AI-generated emails that imply a refund policy or warranty the company never promised
Personalisation systems that rediscover last year's bias in a new way every quarter
Demos that are 80% hand-coded, 20% LLM call, and get called "AI-powered" on the press release
Summarisation features that confidently invent quotes nobody said
Support bots that escalate nothing, making angry users call three times before reaching a human
Hiring screens that quietly filter out good candidates for reasons nobody can explain

The common root cause

Most of these are not AI problems. They are "we shipped without evaluation" problems. Let me say that more precisely, because I have said it in client meetings at least forty times: if you cannot describe how your AI feature is tested, it is not a feature. It is a future incident.

The companies that get AI right have three things most companies skip entirely:

A set of real examples that represent what customers will actually ask (not what the demo script asked)
A defined bar for "good enough" that everyone agreed to before shipping
A continuous eval loop that tells them when quality drops, without relying on customers to complain

Case study: the helpful chatbot that hurt retention

A company I worked with shipped a support chatbot to take pressure off their Tier 1 team. Launch week: queue dropped 40%. Leadership celebrated. Three months later, retention had quietly dropped 6%, and nobody could figure out why.

The chatbot was resolving tickets — technically. But the resolutions were shallow. Users got a polite dismissal from the bot, felt unheard, and churned silently. The metric ("tickets resolved") said everything was great. The metric ("are customers renewing?") said the opposite. The real bug was that "tickets resolved" was never the real goal.

Case study: the personalisation engine that re-discovered bias

Another client built a fancy personalisation engine. It learned from user behaviour, updated weekly, shipped recommendations. Inside six months it had rediscovered every historical bias in their dataset and was showing women different products than men in ways that made the marketing team wince.

The model was working exactly as designed. The design did not include "do not make things worse than the status quo." That is not an AI mistake. That is a specification mistake.

Case study: the enterprise demo that could not survive a Tuesday

Sales demo looked incredible. Customer signs $2M contract. In week three of deployment, real customer data is weirder than demo data, queries break, confidence evaporates. The customer renews for a fraction. Everyone involved learns an expensive lesson about the gap between demo data and real data.

Demos are selected. Real life is unselected. This gap is where most AI products die.

How to not be the next case study

Before you build, write down ten weird real inputs the feature will see. Not five happy-path ones — ten weird ones
Define what "good" looks like in measurable terms. If you cannot measure it, you cannot ship it
Ship to 1% first. Not because it is hard to deploy to 100%, but because it is hard to roll back from 100%
Build a "quality-drop alarm" — an automated check that screams when outputs degrade
Write the rollback plan before the launch plan. If your rollback is "we panic," you are not ready
Treat your AI feature like a third-party vendor — expect it to go down, expect it to hallucinate, design accordingly

If your AI roadmap has no evaluation plan, you do not have an AI roadmap. You have a demo.

None of this is glamorous. None of it makes good press-release material. But the companies that do this quietly are the ones whose AI features are still shipping a year later, with customers who trust them. Everybody else is on their fourth rebrand of the same feature.

All posts— Siddharth Puri

The AI Disasters Companies Quietly Rolled Back — and What They All Had in Common

The AI Disasters Companies Quietly Rolled Back — and What They All Had in Common

The classic failure modes

The common root cause

Case study: the helpful chatbot that hurt retention

Case study: the personalisation engine that re-discovered bias

Case study: the enterprise demo that could not survive a Tuesday

How to not be the next case study

Keep reading

Your Turning Point Has No Warning Label — Just Keep Showing Up

Your Turning Point Has No Warning Label — Just Keep Showing Up

Stop Talking About Your Idea — Here Is How to Actually Ship It

Stop Talking About Your Idea — Here Is How to Actually Ship It

Freelancing Is the Best Accelerated Learning Programme Nobody Talks About

Freelancing Is the Best Accelerated Learning Programme Nobody Talks About