Every firm that takes AI seriously eventually arrives at the same question, usually around the third quarter of their first real build: how do we actually know this thing is working? Not how does it feel in the demo, not how does it score on a vendor benchmark, not what the partner thought the one time they clicked through. How do we know — measurably, repeatably, defensibly — that this agent does the job our firm hired it to do?
That is the eval question. And the firms that get it right are building the only AI moat that actually compounds.
What an eval is, and what it is not
An eval is a graded test for an AI system. Concretely: a fixed input, a way to score the output, and an expected outcome the score is compared against. Run the eval, get a number. Run the eval again next week against a new model or a new prompt or a new tool, get a different number. The point is the comparison.
Evals are not vibes. They are not “I tried it and it seemed fine.” They are not a single golden demo a partner watched once. They are not the model’s self-rating. A real eval suite has three things every entry in it shares:
- An input the firm cares about. Drawn from real work — sanitized, redacted, and version-controlled. Not toy examples.
- A scoring function the firm trusts. Sometimes a strict equality check (did we extract the right number?). Sometimes a structured rubric scored by a second model with explicit criteria. Sometimes a one-line human judgment captured in a labeling tool. All three are valid; what matters is that the function is consistent.
- A target the firm has agreed to. “90% pass rate on the redaction eval before we ship” is a target. “It’s pretty good” is not.
Why this matters more in professional services
In consumer AI, an eval miss means a slightly worse chatbot. In professional services, an eval miss means an answer that walks into a partner’s hands looking confident and wrong. A law firm cannot ship an agent whose accuracy on cite-checking is unknown. An accounting firm cannot ship a workpaper agent whose false-positive rate on reconciliation flags is unmeasured. The downside is asymmetric, and the partner who signs off knows it.
That is why evals are the part of a custom AI build that professional services firms care about disproportionately, and the part most vendors quietly skip. A demo with no eval is a promise the firm cannot enforce.
The four-layer eval stack we ship
For every custom agent we build, there is a four-layer eval suite that ships with the code. The layers are stacked, cheapest first, slowest last. Each layer catches a different class of regression.
- Layer 1 — Schema and contract evals. The fastest tests. Does the agent return the typed shape we promised? Are required fields present? Is the citation array non-empty when the prompt requires citations? These run in seconds on every build.
- Layer 2 — Deterministic output evals. A test set of inputs with known correct answers — extracted values, classifications, routing decisions, exact-match facts. Scored by equality or rule. Catches the obvious regressions when a prompt or model changes.
- Layer 3 — Rubric-graded evals. A second model (or a panel of them) scores the output against an explicit rubric the firm wrote: factual accuracy, citation quality, tone, completeness, refusal correctness. Each criterion is a small, separately-scored axis. Slower and more expensive, but the only way to grade open-ended outputs at scale.
- Layer 4 — Human-in-the-loop spot checks. A small, regular sample reviewed by the partner or senior staffer who actually owns the workflow. Slow, expensive, and the only ground truth we trust for the highest-stakes outputs. The previous three layers exist to make this layer rare.
Every layer has a budget. Layer 1 should run on every commit. Layer 2 on every pull request. Layer 3 on every meaningful prompt or model change. Layer 4 on a fixed sampling rate forever — usually 1% to 5% of production volume, chosen so the partner is not drowning but is also not flying blind.
Cheap evals run often. Expensive evals run rarely. The job of the cheap layers is to catch the easy regressions before the expensive layers see them.
The four failure modes that wreck eval suites
We have inherited more than a few half-built eval suites. The same four mistakes show up every time, and avoiding them is worth more than picking the perfect framework.
1. The eval set drifts. The team adds new test cases as bugs are found, never removes the stale ones, never version-controls the inputs, and after six months the suite is scoring a model against a corpus that no longer reflects real work. Fix: treat the eval set like code. Version it, review changes, and rotate stale items out on a schedule.
2. The model grades itself. A common shortcut is to ask the same model that produced the answer to also rate it. The score looks great until you realize the model is marking its own homework. Fix: rubric grading uses a different family of model than the one being evaluated, and the rubric is explicit enough that two graders agree most of the time.
3. The metric does not match the work. Aggregate accuracy is comforting. It also hides the failure modes the firm actually cares about. A 92% intake-extraction accuracy that is 99% on W-2s and 65% on K-1s is not a 92% agent; it is a 65% agent for the most complex returns. Fix: slice every eval by the categories the partner cares about. Report the worst slice prominently.
4. There is no production eval. The team runs evals before deploy and never again. Production drift — a vendor silently retraining, a new document format showing up, a prompt change with unintended consequences — goes unnoticed for months. Fix: a small, sampled eval runs against production traffic continuously, and the dashboard goes red when a slice regresses.
What ownership looks like
Eval ownership is the question that determines whether the suite survives the first staffing change. The answer is the same in every firm we have shipped to: one named partner owns the rubric, one named engineer owns the harness, and the firm — not the vendor — owns the test set.
That last one is the part most firms forget. If your AI vendor owns the eval set, your AI vendor owns the truth. When you eventually want to switch models, switch vendors, or simply verify a vendor’s claims, the eval set is what makes that possible. Treat it as a strategic asset. Back it up. Keep it in your repo. It is the closest thing the firm has to institutional memory of what “good output” means.
Whoever owns the eval set owns the agent. Make sure that is your firm.
Where evals fit in the build
In our two-week sprints, evals are not the last step. They are the second step. Before we write the first prompt, we write a small Layer 1 + Layer 2 suite — usually 20 to 50 cases drawn from the firm’s real work — and a target the partner has signed off on. Then we build the agent until the suite is passing. Then we add Layer 3 and 4 against the running system.
That order is deliberate. It is the same reason most year-one custom AI builds fail: not because the model is bad, but because the team never agreed on what “working” meant before they started shipping. Evals make “working” concrete.
What evals do not do
A good eval suite does not replace partner judgment. It does not certify the agent against malpractice claims. It does not prove the system is safe in every edge case. It does not turn a bad workflow into a good one. What it does is give the firm a measurable, defensible answer when someone asks “is this thing actually working?” — and a way to keep answering that question as models, prompts, and the world change around the build.
For a partner deciding whether to greenlight a custom AI build, the eval suite is the artifact that should make the decision feel safe. Not the demo. Not the architecture diagram. The eval suite, the targets, and a clear answer to “who runs this every week.”
Next step
Evals are part of every Brightline engagement, not an upsell. If you want to see what an eval suite for your specific workflow would look like — what the cases would be, who would score them, what the targets should be — that is exactly the kind of conversation we have on a 30-minute bottleneck audit.
