Think | Building AI Products, Part 2: The Discipline of Evals

Last month I watched a demo of an AI feature that looked incredible. The presenter typed a complex query, the model produced a beautiful, detailed response, and the room applauded. Then I asked: "How often does it do that?" Silence.

That silence is the gap between a demo and a product. A demo shows what AI can do. An eval shows what it reliably does. And in AI product development, reliability is the whole game.

Developer reviewing code and test results on a screen — the discipline of evals that separates demos from products — A demo shows what AI can do. An eval shows what it reliably does. In AI product development, reliability is the whole game. Photo by Christina Morillo on Pexels.

Why AI products need a different quality bar

Traditional software is deterministic. The same input always produces the same output. If the checkout flow works on Tuesday, it works on Wednesday. You test it, verify it, ship it.

AI features are non-deterministic. The same prompt can produce different results. Quality varies by input, by context, by model version, by temperature setting. A feature that works beautifully for one user's query might produce hallucinated garbage for another's. You can't "test and ship" the way you can with traditional software. You need to evaluate continuously.

This is why evals, short for evaluations, have become the most important discipline in AI product development. An eval is a systematic way of measuring how well your AI feature performs across a representative set of inputs, not just the cherry-picked ones from the demo.

Three levels of evals

The teams I've seen do this best run evals at three levels:

Offline evals (before shipping). You build a test set: a collection of representative inputs with known-good outputs. Every time you change a prompt, swap a model, or adjust a parameter, you run the test set and measure the results. Did accuracy go up or down? Did the tone shift? Did hallucination rate change? This is the equivalent of a unit test suite for AI. It catches regressions before they reach users.

Pre-launch evals (before going wide). Before a feature goes to 100% of users, you run a broader evaluation: edge cases, adversarial inputs, safety checks. What happens when a user asks something the feature wasn't designed for? What happens with ambiguous inputs? What happens in languages or domains the model handles less well? This is where you find the failure modes you didn't anticipate.

In-production monitoring (after shipping). The real world is messier than any test set. In production, you monitor: user satisfaction signals (thumbs up/down, regeneration rate, abandonment), output quality metrics (hallucination rate, relevance scores), and drift (is performance degrading over time as the model or user behavior changes?). Teams that skip this step are flying blind.

Evals as a product practice, not an engineering task

Here's the part that matters for PMs: evals aren't just an engineering concern. They're a product practice.

The PM defines what "good" looks like. What counts as a successful summarization? What tone is appropriate? What hallucination rate is acceptable for this use case? These are product decisions that can't be delegated to the model or to the engineering team. If the PM doesn't define the quality bar, the team optimizes for whatever is easiest to measure, which is usually not what the user cares about.

I've started requiring that every AI feature spec includes an eval plan: what we're measuring, what the baseline is, and what would make us pull the feature back. It felt bureaucratic at first. Now it's the most useful section of the document, because it forces clarity about what success actually means before anyone writes a line of code.

The compound effect

Teams that invest in evals early see a compound effect. Each new test case makes the suite more comprehensive. Each eval run surfaces edge cases that improve the product. Over time, the team builds an institutional understanding of how the model behaves, not just in the demo, but in the wild.

The teams that skip evals ship faster initially. But they spend more time firefighting, more time responding to user complaints about quality, and more time debugging issues they could have caught before launch.

In AI product development, speed without rigor isn't fast. It's expensive.

Next: Part 3, building responsibly.