All writing
11 min readFor professional services

The Two Walls Every Enterprise AI Deployment Hits: Reliability and Runaway Cost

Harvey is worth $11B, Legora $5.6B, and Kirkland is spending half a billion to build its own. Underneath the legal-AI gold rush, two hard walls are showing up: agents that fail the math of reliability, and token bills that scale in ways no SaaS budget was built for. Here is what that means for buy-vs-build — and the way through.

enterprise AIlegal AIHarveyLegorabuild vs buyAI costsreliabilitysoftware 3.0professional services

From the outside, enterprise AI looks like an unambiguous boom. Harvey just raised $200 million at an $11 billion valuation. Legora, its Swedish rival, is at $5.6 billion. Kirkland & Ellis is reportedly spending $500 million to build its own. The money says this is settled. It isn't. Underneath the headlines, two hard walls are coming into view at once — and how your firm handles them, not which logo you license, is what will actually decide whether AI pays off.

The first wall is reliability: the uncomfortable arithmetic that makes a model that dazzles in a demo fail nine times out of ten in production. The second is cost: the way per-token pricing turns a predictable software budget into a utility bill that nobody can forecast — running on prices that are quietly subsidized today and will not stay that way. Let's walk through both, and then the part nobody selling you a platform wants to dwell on: what to actually do about it.

Wall one: the brutal math of reliability

Enterprises are not slow to adopt AI because they're timid or behind. They're slow because they can do multiplication. If you have an agent that is 80% accurate on a task and you run it ten times in a row, the probability it gets all ten right is not 80%. It's 0.8 to the tenth power — about 10.7%. There's a nine-in-ten chance it fails at least once. Error compounds, and enterprise automation means running a task not ten times but thousands or millions of times. A 90%-accurate agent — what many vendors quietly ship — fails roughly a third of a ten-step workflow.

Andrej Karpathy has a name for the gap between a demo and a product: the "march of nines." From his years leading Tesla Autopilot, he describes how each additional nine of reliability — 90% to 99% to 99.9% to 99.99% — costs roughly as much engineering as everything that came before it. A demo that works 90% of the time is just the first nine. Hands-off enterprise automation needs three, four, or five nines. Most agents are scratching at one.

If you want that abstraction made concrete, look at the benchmark Harvey itself published this month. Its Legal Agent Benchmark (LAB) graded frontier models on long-horizon legal tasks against tens of thousands of expert-written rubric criteria, under a strict "all-pass" standard. The result: the frontier models tested completed fewer than 10% of complex legal tasks end-to-end. Claude Opus 4.7 led at 7.1%, followed by Sonnet 4.6 at 5.4%, Opus 4.6 at 4.2%, GPT-5.5 at 2.1%, and Gemini 3.5 Flash at 0.8%. That is the market leader telling you, in its own data, that today's models do an okay-ish job and then fail miserably the moment you stop holding their hand.

This is not an argument against using AI. It's an argument against pretending a chatbot is an autonomous worker. The reliability math is exactly why we keep saying that generic chatbots fail in professional-services firms and why evals — not the model — are the real moat. A workflow you can measure, constrain, and verify gets you the extra nines a raw model never will.

Wall two: the cost no SaaS budget was built for

For thirty years, software pricing was simple because software's marginal cost was roughly zero. You bought seats, you renewed seats, you added seats. AI breaks that model in both directions. Agents don't consume seats; they consume tokens, tool calls, and compute cycles. One heavy user can burn more in a week than a light user does in a year, and the spend lands as something closer to a metered utility bill than a predictable line item.

Nobody illustrates this better than Uber. Its President and COO, Andrew Macdonald, told the Rapid Response podcast that watching engineering burn through the company's entire 2026 AI budget in about four months was a "head-exploding moment." This is Uber — a company that understands metered, surge-priced economics better than almost anyone alive. If they can't forecast their own token consumption, the twelve-attorney firm buying a per-seat legal AI license on a flat annual quote should be nervous about what happens when usage actually scales.

And here's the part that's easy to miss while prices look cheap: they are cheap on purpose. Frontier inference today is heavily subsidized by venture and hyperscaler capital — the major labs are widely reported to be spending well more than a dollar to earn each dollar of revenue. That is a land-grab price, not a sustainable one. As the subsidies normalize, the per-token cost that makes a deployment pencil out today can quietly move against you. We wrote about the underlying mechanics in the tokenizer tax: the cost of a workflow is not the sticker price of the model, it's the price of the model times every token it touches, forever.

Stack the two walls together and you get the real enterprise problem. The most capable model is the most expensive and the slowest — Harvey's own data shows its top performer cost roughly $50.90 per task and took about 22 minutes per run — and even then it fails the all-pass bar more than 90% of the time. You are paying frontier prices, on subsidized rates that will rise, for output you still have to check. That is a structurally bad trade if you deploy it naively.

The squeeze on the wrappers

Now layer in the market shift. For two years the bet was that a startup could wrap a frontier model in a legal-specific interface and own the relationship. That bet is being pressured from both ends.

From above, the model labs are moving up the stack. Anthropic launched Claude for Legal with practice-area plugins and more than 20 connectors into the tools lawyers already use — the first coordinated move by a model maker straight at the vertical incumbents built on top of it. OpenAI is doing the same in the enterprise generally with Frontier and a Codex business that has grown more than five-fold this year. When the platform you wrapped starts shipping the wrapper, "we put a nice UI on Claude" stops being a moat.

From below, the floor is falling out. A former Latham & Watkins associate, Will Chen, released Mike (MikeOSS) — an open-source legal AI platform that does much of what Harvey and Legora do, free to self-host, built in about two weeks. It collected a thousand GitHub stars in three days. As one commentator put it, it signals the end of legal AI's "secret sauce." If the interface layer can be cloned in a fortnight and given away, the eleven-figure valuation is buying brand and distribution, not defensible technology.

So is there a home for the wrappers? Yes — but a narrower one than the valuations imply. There will always be firms that want a supported, off-the-shelf product and are happy to rent the same capability their competitors rent, on the same terms. What's disappearing is the idea that the wrapper is where the durable advantage lives. The advantage was never the chat box. It's the workflow underneath it — and that is precisely what a horizontal platform optimizes away.

Why Kirkland is building, and what it actually proves

This is the context that makes Kirkland's $500 million decision legible. The highest-grossing law firm in the world, with every option money can buy, looked at the buy-versus-build question and chose to build around its own lawyers' workflows — informed by some 250 of its own attorneys, with outside vendors barred from reselling the result. Chair Jon Ballis framed the goal as taking "the collective intelligence of our institution" and deploying it across the firm.

The lesson is not "spend half a billion dollars." It's that the most sophisticated buyer in the market concluded the value is in the bespoke workflow, not the rented platform — and refused to let that advantage become a product its rivals could license. That logic does not require a Kirkland budget. It scales all the way down to a twenty-person firm. The difference is scope, not strategy.

The way through: deploy like an engineer, not a tourist

Here is the hopeful part, and it's genuinely hopeful. Both walls are engineering problems, and engineering problems have solutions. The firms that win with AI are not the ones who buy the most expensive model. They're the ones who deploy cleverly.

Start with the most important and most ignored principle: you do not need a frontier model for most of the work. A recurring cash-flow forecast, a standard accounting reconciliation, a monthly variance report, a first-pass document classification, a routine intake summary — none of these need a $50-per-task reasoning model. They need a small, cheap, fast model wired into a tightly scoped workflow with the right data and a verification step. Reserve the expensive frontier model for the genuinely hard reasoning, and route everything else to something an order of magnitude cheaper. That single decision — model routing — is often the difference between a deployment that pencils out and one that becomes Uber's head-exploding moment.

The reliability wall yields to the same discipline. You don't get the extra nines from a better prompt; you get them from constraining the problem — narrowing scope, grounding the model in your real data, adding deterministic checks, and building evals that catch regressions before they reach a client. A measured, bounded workflow at 99.9% beats an unbounded chatbot at 90%, every time.

And the cost wall yields to ownership. When the workflow is yours — running on models you can swap as prices and capabilities change, with spend you can see and govern — you are insulated from both the subsidy unwinding and the vendor's next price increase. You stop renting the same averaged-down tool as everyone else and start compounding an asset that fits how your people actually work.

This is what Software 3.0 is for

That's the whole idea behind Software 3.0: software written in plain language, shaped around your organization, built fast, and owned outright. Not a horizontal platform that averages away the differences between your teams, but a bespoke workflow that leans into them — the specific fifteen minutes your tax group saves differently than your audit group, the exact reconciliation your firm runs, the intake your litigators actually use. Delivered for a fixed price in a few weeks, and yours forever.

That is exactly what Brightline Labs does. We don't sell you a seat on someone else's roadmap. We sit in the weeds of how your firm actually works, build the one or two workflows that matter around it, deploy the right model for each job rather than the most expensive one for all of them, and hand you something you own. The frontier-model boom is real. The way to survive its reliability and cost walls is to stop being a tourist and start deploying like an engineer.

Operator takeaways
  • Reliability compounds against you: an 80%-accurate agent run ten times succeeds all ten only ~10.7% of the time. Harvey's own benchmark shows frontier models finishing fewer than 10% of complex legal tasks end-to-end.
  • Token-based cost scales like a utility bill, not a SaaS seat — and today's prices are venture-subsidized and will rise. Uber blew its entire 2026 AI budget in four months.
  • The wrappers are squeezed from above (Claude for Legal, OpenAI Frontier) and below (open-source clones like MikeOSS). The durable advantage was never the interface — it's the workflow underneath.
  • Don't use a frontier model for forecasting, reconciliations, or routine reports. Route cheap models to easy work; reserve the expensive one for hard reasoning.
  • The extra nines come from scope, grounding, checks, and evals — and cost control comes from owning a workflow you can re-point at cheaper models over time.
Related Brightline reading

If you want help drawing the line between what to rent and what to own — and figuring out which one or two workflows are worth building around how your firm actually works — that's a good first conversation to have.

Sources watched

Have a workflow that sounds like this one?

Every engagement starts with a 30-minute conversation. No pitch. No proposal until we understand your problem. If we can't help, we'll tell you.

Get in Touch