A production AI feature I own worked yesterday. Today it fails one call in four. No code change. No prompt change. No model swap. The retrieval index hasn’t been rebuilt. Nothing in the system, by any reading of the diff, has moved. And yet a quarter of the answers are wrong, in a quietly confident way that no user is going to flag as a bug — they’ll just notice the product is dumber than it was a week ago and leave.
This is what life without evals looks like. And the reason most teams shipping AI in 2026 still don’t have evals is that the discipline is harder than it sounds, more expensive than it appears, and produces no satisfying green checkmark. It’s also the single most important engineering practice separating teams that ship AI products from teams that ship AI demos.
This is the third leg of a trilogy. Context engineering is what the runtime assembles at the moment a model is called. Spec-driven development is what the team writes that the runtime ends up assembling from. Evals are how you know any of it is working. Without the third leg, the first two collapse — the spec is faith, the context is theater, and your “AI feature” is a thing that worked the three times you demoed it.
The thesis is sharper than people are comfortable with. If you ship AI features without evals, you don’t have a product. You have a demo with users.
Why traditional testing doesn’t apply
Every engineer reading this has thirty years of testing intuition built on a load-bearing assumption: the system under test is deterministic. Given the same input, it produces the same output. A unit test is a (input, expected-output) pair you can pin to a commit and run forever.
That assumption is now wrong for a meaningful slice of your codebase.
A model call is not a function. It’s a probability distribution over outputs, sampled. Two identical calls produce different strings, both potentially valid. A call that produced the right answer in March may produce a different right answer in April — or a wrong one — because Anthropic, or OpenAI, or your retrieval index, or your system prompt, or your tool descriptions, or your model version, or your max-tokens setting, or your temperature changed by a hair. The whole test-pyramid model — unit, integration, end-to-end, all green, ship it — was built on a premise that’s now partial. The unit-test layer at the bottom of the pyramid still works for everything below the model boundary. Above it, the pyramid inverts and gets weird.
The temptation is to wave this away. “We’ll set temperature to zero and write assertions.” This buys you about three weeks of false confidence. Temperature-zero is not determinism — it’s low variance. Tool-call ordering still varies. Retrieval can return different chunks for the same query because your index has a new doc in it. The model behind the API endpoint can be updated without notice. The illusion of determinism is what makes the bug, when it arrives, untraceable.
Evals are the discipline of testing a non-deterministic system on purpose, with the right tools for the job. They are not unit tests with extra steps. They’re a different shape.
What an eval actually is
An eval is a (input, behavior assertion, scoring rubric) triple that runs the full system — model call, tools, retrieval, history, all of it — and scores the output against a rubric that tolerates the system’s natural variance while still catching real regressions.
You’ll see four shapes in mature codebases, and they exist on a spectrum from cheap-and-brittle to expensive-and-honest.
| Shape | What it does | When to use | Cost | What it catches |
|---|---|---|---|---|
| String / regex match | Asserts presence or absence of literal patterns in the output | High-stakes refusals, fixed-format outputs, structured-field presence | Free | Catastrophic misses, format breaks |
| Structured field check | Parses output as JSON/XML and asserts on specific fields | Tool-call shape, schema-bound outputs, classification labels | Cheap | Schema drift, label flips, malformed responses |
| LLM-as-judge | A second (usually stronger, different-vendor) model scores the output against a rubric | Open-ended outputs, tone, helpfulness, multi-criterion quality | Medium ($) | Quality regressions, subtle drift, persona breaks |
| Human review | An actual human grades a sample | Gold-standard calibration, judge-model calibration, novel failure modes | High ($$$) | Everything else; the floor you calibrate other shapes against |
The mistake is picking one and treating it as sufficient. String matches are nearly free but tell you almost nothing about whether the output is good — only whether it’s not catastrophically broken. LLM-as-judge captures nuance but inherits its own model’s biases and failure modes. Human review is the only honest answer to “is this any good?” but doesn’t scale to CI. Mature eval suites layer all four: cheap checks gate every run, LLM-judges sample a percentage, humans calibrate the judge against a held-out gold set on a quarterly cadence.
The four layers of evals
The trap I’ve watched team after team fall into is thinking eval is one thing. It’s at least four, and each layer fails differently when you skip it. The framing parallels the five layers of context on purpose — readers who internalized that frame can use it here too.
Capability evals. Does the model itself do the thing, at all, in isolation? These are the evals Anthropic and OpenAI run to decide whether a new model can replace the previous one. You don’t write these. You consume them — and you pay attention when a model card publishes scores that move materially on a benchmark close to your use case. Capability evals are the ground floor. If the model can’t do the task on its own, no amount of context engineering will rescue you.
Behavior evals. Given your prompt and your context, does the model produce the shape you want? This is the layer most teams stop at, and it’s the cheapest. Behavior evals catch “the JSON has the right fields,” “the refusal triggers when it should,” “the model picks tool A and not tool B.” If you only run one layer, run this one. But behavior evals are not enough — they verify form, not substance.
System evals. Does the full pipeline — retrieval, planning, tool calls, history, summarization, refusal, all of it — produce the right end-to-end outcome on a realistic input? System evals are where most real bugs live. The retrieval returned the wrong chunk. The history summarizer dropped a constraint. The planner called the right tool with the wrong arguments. The behavior eval on each component passed; the end-to-end behavior failed anyway. Most teams have never run a system eval. It shows.
Regression evals. When something changes — a system prompt edit, a new tool, a model version bump, a retrieval index rebuild, a Claude.md tweak — what previously-passing examples now fail? Regression evals are the layer that catches “we shipped a prompt change on Tuesday and silently broke 12% of customer flows.” Without regression evals, you don’t know what your changes broke. You find out from users. You then find out the change you blame is not the change that broke it, because three changes shipped that week. Regression evals are the single highest-leverage layer to add if you don’t have it.
The pattern: most teams do partial behavior evals, no system evals, and ad-hoc regression. The teams that ship reliable AI features do all four, with a clear ownership story for each.
The four hard problems of eval engineering
Mirroring the context-engineering frame on purpose: the discipline boils down to four problems that show up on every non-trivial AI feature. None is solved with more eval cases. All four are systems-engineering problems.
Coverage: how do you know your eval set represents what users actually do?
The eval set you wrote on day one was based on the inputs you imagined users would send. The inputs they actually send are different in ways you didn’t predict — different lengths, different languages, different politeness, different attempts to break the system, different domains you hadn’t considered. An eval set that doesn’t reflect production traffic is decoration.
The teams that solve this treat eval-set curation as an ongoing data-pipeline problem. Sanitized production prompts get sampled into the eval set continuously. Edge cases that broke once get pinned in. The set grows; old cases get pruned when they’re no longer representative; coverage gets measured against actual traffic distribution. The teams that don’t write twenty cases on day one, ship, and act surprised when the system fails on the twenty-first.
Stability: how do you tell a real regression from noise?
A non-deterministic system gives different answers each run. Your eval scoring will fluctuate even with no changes underneath. Some of that fluctuation is meaningless variance; some of it is the leading edge of a real regression you should chase. Telling the two apart is harder than it looks.
The answer is statistical, not algorithmic. Run each eval case N times (5–20 depending on cost), aggregate, and treat pass-rate (not pass/fail) as the metric. Set thresholds with the underlying variance in mind — a flip from 95% to 92% over a single run is noise; a sustained flip from 95% to 85% over three runs is a regression. This is closer to A/B-test discipline than to unit-test discipline. Engineers who haven’t done experiment design before find this part the most uncomfortable.
Cost: how do you run a thousand evals without going broke?
Every eval run is one or more model calls. A serious eval suite — say, three hundred cases, five samples each, LLM-as-judge on top — is fifteen hundred model calls per run, two of those per case (the system under test, plus the judge), so three thousand calls. At realistic 2026 prices that’s a non-trivial bill per CI build. Run it on every PR and you’ll get a finance question by the end of the quarter.
Mature teams tier the suite: a small smoke set runs on every PR, the full suite runs nightly on main, the largest LLM-judge sample runs weekly. Critical paths get more samples; long-tail cases get fewer. The cost is a budget you manage explicitly, the same way you manage CI minutes. The teams that don’t tier either run nothing or burn money pointlessly.
Drift: what happens when your golden answers go stale?
The answer that was correct in February may be wrong in May because the product changed, the data changed, the policy changed, or the world changed. “Last year’s tax rate” is no longer the right answer. “The CEO of company X” updates without warning. Your eval set rots if you don’t maintain it. And the rot is invisible until a passing eval is actually testing wrong behavior.
The discipline is twofold: pin golden answers to a date and a source where possible, and review the eval set on a cadence. The latter sounds tedious; it is. It’s also the only way to keep the set honest. Treat eval-set maintenance as a recurring engineering line item, not a one-time investment.
Five rules I’ve learned in production
Numbered because they’re earned, not derived.
1. Real prompts, never synthesized. The single most common mistake I see is teams populating their eval set with prompts they wrote themselves, in the voice they imagine users speak. Real users don’t speak like that. They misspell. They paste in markdown. They include irrelevant context. They cut sentences off mid-thought. An eval set built from imagined inputs tells you about an imagined system, not your real one. Always seed the set with sanitized production traffic. Always.
2. LLM-as-judge needs a different model. The temptation is to grade Claude’s output with Claude. It’s cheaper and the prompt is familiar. It’s also a feedback loop that flatters your system. Use a different model, ideally from a different vendor, for the judge — and on the highest-stakes evals, use a stronger model than the one being graded. The same family scoring itself catches form, not substance. Cross-vendor scoring catches things the family is collectively bad at.
3. Treat eval drift as a bug, not noise. When a previously-passing eval starts failing without a code change, the lazy reaction is to retry, observe it passes, and move on. The grown-up reaction is to investigate. Something moved — the model behind the API, your retrieval index, an upstream dependency. “It passed the next time” is not a closed ticket. Quietly degrading production AI features almost always look like silent eval-drift first; the team that learns to chase drift catches incidents three weeks before the team that doesn’t.
4. Run regression evals on prompt and context changes, not just model swaps. Most teams that have any regression discipline at all only run it when changing model versions. A two-word change in a system prompt can fail thirty evals — and will. A new tool added to the toolbox can break tool-routing on inputs that have nothing to do with the new tool. A reorganized AGENTS.md can change how the model interprets every prompt. Treat every change to the context pipeline as a candidate regression, and gate it with the eval suite the same way you gate code with tests.
5. Evals belong in CI. If they’re not blocking, they’re decoration. A nightly eval report nobody reads is not an eval suite. A dashboard with red squares that ships anyway is not an eval suite. The whole point of evals is that they prevent regressions from reaching production — which means they have to block. Yes, this is hard. Yes, the variance makes it harder. Yes, the cost question is real. Solve those problems instead of solving the gate-removal problem. The team that doesn’t gate is doing eval theater, not eval engineering.
Where this still falls apart
I’d be selling something if I left it there. Evals are necessary; they are not sufficient; and there are real cases where the discipline doesn’t apply cleanly.
- Truly open-ended outputs. “Write a poem about loss.” There is no rubric. LLM-as-judge can score for tone and adherence to constraints, but it can’t score for good poem. For these surfaces, evals catch the floor (safety, format, length) and humans own the ceiling. Pretending otherwise produces evals that grade for the average and punish the exceptional.
- Multi-turn root-cause problems. A seven-turn agent run fails at turn seven. The cause was a state mistake at turn two. The eval flags the failure at turn seven, but the diff between a passing trace and a failing trace is buried five turns deep. You need trace-level evals — scoring the pipeline, not just the final output — and most teams don’t have the tooling. The honest answer in 2026 is that multi-turn eval tooling is still being invented; expect this to be where the field invests next.
- Cost on the largest suites. A serious eval suite for a serious AI product can cost more per CI run than the rest of CI combined. There is no clever trick that makes this go away. You manage the cost with tiering, sampling, and budget discipline — or you accept that your eval coverage is bounded by your bill.
- Eval set as a product. The eval set itself becomes a thing you own, version, review, and maintain. It’s another codebase, hidden inside your codebase. Teams that don’t take this seriously end up with a folder of stale assertions that everyone hates running. Teams that do treat the eval set as a first-class artifact — owned, reviewed, refactored — get compounding returns.
The honest framing: evals are non-negotiable for shipped AI products in 2026, and they’re harder than they sound. The teams that own the discipline outship the teams that don’t, not because the engineers are better, but because they actually know what they’re shipping.
The career angle: eval engineer is the new role
Three layers of the ladder, three different stakes.
Juniors. The fastest way to become useful on an AI team is to volunteer to own the eval set. Nobody wants to. The work is unglamorous. It’s also where you learn the system, where you discover the actual failure modes, and where senior engineers will notice your work because you’re catching their bugs before users do. An engineer who shows up in year one and builds a real eval suite for a real feature is doing more useful work than the engineer next to them who shipped twice as many lines of agent-generated code. The leverage compounds the same way it does with tests, only more so, because the cost of a missed regression is higher.
Mid-levels. The career risk is real. If you spent the last three years getting better at writing AI features and you skipped the evals discipline, you’re a worse engineer than you think you are. The recovery move is to pick the most flaky AI feature on your team and own its eval suite end to end for a quarter. You’ll learn more about production AI in three months of eval work than in a year of feature work. You’ll also be the only person on your team who can answer “is this working?” with anything other than vibes — which is the question that gets asked in the rooms where promotions happen.
Seniors and staff. “Show me your eval suite” is the new “show me your test coverage.” It’s the question I now ask in every AI-engineering interview, and the quality of the answer tells me more about the candidate than any other single signal. At your level, the work isn’t writing the evals — it’s owning the discipline. Setting the bar for what’s blocked. Building the cross-functional muscle to keep human review in the loop. Convincing finance to fund the eval bill. Negotiating with product to ship gated on eval pass-rate, not on launch date. These are political and architectural problems disguised as testing problems, and they’re exactly the work seniors are paid to do.
The market has not caught up. The number of engineers who can build a serious eval suite for an agent system is small enough that hiring managers I talk to describe it as a single-digit-percentage skill. The premium will get smaller as the discipline becomes standard practice, the same way unit-testing did between 2005 and 2015. The window where this skill is rare is roughly now — eighteen months, two years tops.
The deeper point
Spec is the truth. Context is the assembly. Evals are the proof.
These are three different artifacts, three different disciplines, three different muscle groups. Mature 2026 AI engineering does all three. The teams I work with that do only one ship demos. The teams that do two ship products that quietly degrade. The teams that do all three ship products that keep working, which turns out to be the only feature anyone outside engineering can tell apart.
The framing that took me longest to internalize is this: in a non-deterministic system, the test suite is what makes the system real. Without evals, your AI feature is the average of three demos and your hope. With evals, it’s a system you actually understand. The difference is the difference between an engineer and a tour guide.
If you want to go deeper, the courses below cover the parts of this discipline I teach most often: Building LLM-Powered Apps: RAG & Agents for system-eval design across retrieval and agent pipelines, Building with Claude API: Production AI Apps with the Anthropic SDK for the production-runtime side where evals live in CI, Building Agents with the Claude Agent SDK for multi-turn agent eval tooling, and Claude Code Mastery: Agentic Coding for Engineers for the daily workflow of treating the eval suite as a first-class artifact alongside the spec and the context pipeline.