All writing
9 min readFor professional services

The Three Custom-AI Mistakes That Kill Professional-Services Builds in Year One

The firms that get the most out of bespoke AI in year one do not pick the most exciting workflow or the fanciest model. They avoid three specific mistakes the rest of the market keeps making — none of them technical.

Law firmsAccounting firmsCustom AIImplementation

I have now watched enough law and accounting firms spin up their first serious custom-AI engagement to see the same three mistakes wreck year one over and over. They are not technical mistakes. A capable engineer can ship a working module in two weeks. That is the easy part. The hard part is everything around it — the decisions a firm makes in the first ninety days that decide whether the software quietly earns its keep or quietly gets abandoned.

None of these mistakes are exotic. Every one of them reads as obvious once you name it. But they sit in a place the vendor deck never highlights, because acknowledging them would slow down the close, and they sit in a place the firm does not instinctively protect, because these are not the kinds of things partners usually think of as risks. The result is a depressingly consistent pattern: an ambitious first build, a strong first demo, a slow twelve-month fade, and a partner meeting twelve months later where somebody asks whether the project is actually doing anything.

Here are the three mistakes. They are ranked from most common to most expensive.

Mistake 1: Picking the wrong first workflow

Most firms start with the workflow that has the most visible pain. That instinct is wrong most of the time. Visibility and suitability are two different axes.

The best first workflow for a bespoke AI module has four properties: it repeats frequently, the inputs and outputs are crisply defined, the firm has internal expertise to evaluate the outputs, and the outputs are legible enough that a reviewer can spot a bad draft in under a minute. Miss any one of those and the project can still succeed, but the work to make it succeed expands dramatically.

Where I see firms go wrong: they pick the workflow that their highest-paid partner complains about the most. That workflow is usually highly variable, thin on structured inputs, and reviewed by someone with almost no time to evaluate drafts carefully. It is a fine workflow to automate eventually. It is a terrible one to start with.

The best first workflow is not the loudest one. It is the most repeatable one the firm still does by hand.

The fix is almost embarrassingly simple: spend an hour writing down five candidate workflows, score them against the four properties above, and pick the one that scores highest. Most firms skip this step because it feels like planning, and planning feels like the opposite of shipping. But fifteen minutes of honest scoring is worth more than fifteen days of corrective rebuilding after the wrong choice.

Mistake 2: Treating the module like a SaaS rollout

Firms know how to roll out SaaS tools. Over the last fifteen years they have installed practice-management systems, document-storage platforms, time-capture tools, and at least three generations of CRM. The playbook is well-worn: pick a vendor, run training sessions, push adoption, measure usage, escalate to leadership when someone is not logging in.

A bespoke AI module is not a SaaS tool. Applying the SaaS playbook to it is the second most expensive mistake firms make.

The difference is structural. A SaaS tool has a large surface area of features and a small surface area of integration with the firm’s actual work. A custom module has the opposite shape: it has a small surface area of features (usually one workflow) and a large surface area of integration with the firm’s actual work — its data, its templates, its review conventions, its client-specific quirks. The value of the module comes from the depth of that integration, and the way to extract that value is emphatically not a training session.

The way to extract the value is to integrate the module into the one person’s daily routine who owns the workflow, watch what breaks, and iterate. A paralegal who does transcript outlines every morning. A senior preparer who triages client documents the first week of every engagement. The module needs to feel less like a tool and more like an invisible upgrade to that person’s existing rhythm.

Side note

The best validation that a module is working is that nobody at the firm mentions it after the fourth week. It has disappeared into the workflow. SaaS rollouts generate constant feedback because the tool is constantly visible; a good bespoke module becomes furniture fast, and furniture is invisible by design.

Firms that treat the module like a SaaS rollout end up measuring usage instead of outcomes, pushing adoption instead of iterating, and generating training materials instead of fixing the two or three edge cases the module is silently failing on. Twelve months in, usage looks fine on a dashboard. Outcomes look unchanged on the P&L. Nobody knows why.

Mistake 3: Skipping the evaluation layer

The third mistake is the one I worry about most, because it is the hardest to see coming and the most expensive to undo. It is also the one that separates firms that scale bespoke AI from firms that pilot it and stall.

An evaluation layer is the part of a custom AI system that tells you, on an ongoing basis, whether the outputs are still good enough. Not whether the model is running. Not whether users are logging in. Whether the actual work product the module is producing meets the firm’s bar.

Most first-time buyers do not ask for an evaluation layer, and most vendors do not include one by default. The result is a system that works beautifully the first week, drifts imperceptibly for the next six months as model versions change and edge cases accumulate, and one afternoon produces a bad draft that reaches a client before anyone notices. Then the partners blame the tool, the tool gets turned off, and the firm swears off custom AI for eighteen months.

An evaluation layer does not need to be fancy. For most professional-services modules, it is some combination of:

  • A small, maintained gold set of inputs with known-good outputs. New model versions, prompt changes, and data sources get tested against it before they go live.
  • A lightweight review queue that samples real outputs weekly and flags anything a senior reviewer disagrees with. The sample does not need to be large; it needs to be consistent.
  • A clear rubric for what “good enough” means on that workflow, written down, agreed to by the people who will use the module. Not “accurate.” Not “high quality.” A specific bar, with examples.

Firms that build an evaluation layer in the first ninety days look, a year later, like firms that have figured out custom AI. Firms that skip it look, a year later, like firms that are quietly deciding the technology is not ready. The technology was ready. The measurement was missing.

What the good firms do instead

The firms that end year one with a working AI program typically do three things in sequence, and they do them before they build anything.

First, they pick a single workflow using the four properties above. They write down why they picked it, and what would count as success. That document is a paragraph, not a hundred-page business case. But it exists, signed, before anyone writes code.

Second, they identify the one person inside the firm who owns that workflow and will use the module day to day. They put that person on the build team, not on a training roster. The module is designed with them, not for them.

Third, they budget the evaluation layer at the start, not the end. The rubric gets written during scoping. The gold set gets built during the first week of development. The review queue goes live the same day the module does.

None of this is technically hard. It is the opposite of technically hard; it is the part of the project that can happen in a partner’s office with a legal pad. But because it is not the exciting part, it is the part that gets skipped. And because it is not the exciting part that gets skipped, year one looks mostly the same regardless of what the technology can do: modules that run, nobody sure what they are worth, nobody sure whether to scale the program up or quietly let it shrink.

The technology in 2026 is genuinely good. The first-year failure mode in 2026 is mostly about process. That is, as always, the mode that is easiest to fix if you name it in advance — and the most expensive one to fix after the fact.

If you want a second set of eyes on your first candidate workflow, your rollout plan, or the rubric you haven’t written yet, that is what the thirty-minute bottleneck audit is for. Bring your list, your context, and whatever half-formed rubric you have. I will tell you which workflow I would pick first and why, and I will tell you honestly if none of them are ready to build.

Have a workflow that sounds like this one?

Every engagement starts with a 30-minute conversation. No pitch. No proposal until we understand your problem. If we can't help, we'll tell you.

Get in Touch