Why AI systems fail in month three

There is a pattern worth naming, because it is common enough to be predictable. An AI system is built. It is shown to the people who will decide whether to approve it, and it performs well. It is approved. It goes into real use. For a while it is fine. Then, somewhere around the third month, it returns a wrong answer on a case that matters, or behaves in a way no one expected, and the people who depend on it stop trusting it.

It is tempting to read this as an AI problem: the model was not good enough, the technology is not ready. That reading is usually wrong, and it is worth being precise about why.

A demo is a controlled setting. The inputs are the inputs someone chose to show. The questions are asked the way the builder expected them to be asked. Production is not controlled. It brings the messy input, the case nobody planned for, the question phrased in a way no one predicted. An AI system meets all of that, and an AI part, by design, will attempt an answer rather than stop. If nothing contains it, it will eventually attempt an answer it should not, and do so with full confidence.

The third month is not magic. It is roughly how long it takes for real use to surface the cases a demo did not. The failure was present at launch; it simply had not been triggered yet.

The fix is not a better model. It is two pieces of engineering that are usually skipped. The first is the line: a clear, documented decision about which parts of the system must be exact and which should use judgment. Most failures begin with that line drawn carelessly, a judgment part placed where an exact part belonged. The second is the guardrails: the limits, the grounding, the defined behaviour under uncertainty, and the tests that measure whether answers are good. Guardrails are what keep a judgment part from quietly doing the wrong thing when production hands it something the demo never did.

A system built with both holds up in the third month because the third month brings nothing the engineering did not already account for. That is the whole difference. It is not cleverness. It is the unglamorous work done before launch, and continued after it.

Why AI systems fail in month three.

Find the third-month risk early.