Deep models are the future of enterprise AI. I have zero doubts about it. The companies that win with automation will not be the ones that bolt a general-purpose chatbot onto every workflow and hope the model develops taste. They will be the ones that turn their best operating knowledge into specialized systems that are narrow, tested, observable, and extremely good at the job they were hired to do.
In enterprise automation, “pretty good at everything” is often the wrong target. A customer-support agent does not need to be good at sales. It needs to be exceptional at support. It needs to know the product, the escalation policy, the refund rules, the regulated language, the internal tools, the customer tiers, the edge cases, and the exact moment when it should stop and hand the matter to a person.
What I mean by deep models
A deep model is not merely a bigger model. It is a model, or model system, made deep in one business domain: trained, tuned, evaluated, and wrapped around the specific work a company actually performs. It may start with an open-source base model. It may use a frontier model as a teacher. It may combine retrieval, fine-tuning, tool calling, deterministic checks, and human review. The point is not the training recipe. The point is ownership of the capability.
A shallow AI system borrows intelligence from a general model and pushes instructions into a prompt. A deep AI system absorbs the company’s workflow into the product itself: the schemas, the examples, the policies, the historical decisions, the failure cases, the evals, and the telemetry. The intelligence is not only in the model. It is in the whole loop around the model.
That distinction matters because enterprise work is mostly not general. It looks general from the outside because the words are familiar: classify this ticket, draft this email, reconcile this account, triage this intake, summarize this contract. Inside the company, each of those workflows is full of house rules, exceptions, implicit priorities, risk tolerances, and weird-but-important details that never appear in a benchmark.
General-purpose models flatten the work
Frontier general-purpose models are astonishing. They are especially strong in very broad domains like coding, where the public training signal is enormous, the syntax is explicit, and the feedback loop can be tight. They are also excellent for brainstorming, summarization, first drafts, and exploratory reasoning. None of that is the same as being great at a company’s internal workflow.
Enterprise automation punishes shallow understanding. The model must know when a support response should preserve churn risk instead of optimizing for speed. It must know when a finance exception is a real control issue instead of a formatting error. It must know when a legal clause is market, when it is unusual, and when it violates the way this company specifically does business.
A general model can often produce a plausible answer. The problem is that plausible is not the enterprise bar. The bar is repeatable, auditable, policy-aware, cost-aware, and measured against the cases that actually cost the business money when they go wrong.
The future enterprise question is not “which model is smartest in general?” It is “which system is deepest at our work?”
The support-agent example is the whole point
Imagine a support agent for a B2B software company. The demo version answers questions, sounds friendly, searches docs, and maybe creates a ticket. That is enough to impress a room. It is not enough to run the function.
The production version needs a much deeper operating model. It needs to distinguish a confused free user from a renewal-risk enterprise account. It needs to know which bugs have workarounds, which customers are under custom SLAs, which answers legal has approved, which product limitations should be acknowledged plainly, and which issues should route to engineering with a clean reproduction packet. It should not improvise a pricing concession because it noticed a sales opportunity. It should not cross-sell when the customer is angry. It should not optimize a metric that makes the dashboard look good while the account gets worse.
That is not a generic assistant problem. That is a company-specific operating problem. The answer is not a more charming model. The answer is a deeper one.
Why companies will own more of the model layer
As companies become more sophisticated, they will stop treating OpenAI or Anthropic as the only possible production brain for every workflow. Those models will still matter. They may remain the best option for broad reasoning, coding, evaluation, synthetic-data generation, and high-stakes fallback paths. But many production workflows will move toward company-owned or company-controlled models: fine-tuned open-source models, distilled specialists, private adapters, and small models running behind strict contracts.
The reason is practical. A company that owns the specialized layer can control cost, latency, privacy posture, deployment geography, observability, upgrade timing, and degradation behavior. It can decide when a cheaper model is good enough for low-risk tickets and when a frontier model is justified. It can measure performance per customer, per workflow, per tier, per model version. It can swap components without rewriting the business.
That is where enterprise AI starts to look less like buying a chatbot and more like building infrastructure. The model becomes one component in a governed system: typed inputs, typed outputs, eval gates, permission checks, audit logs, rollback plans, and cost telemetry.
The frontier labs will keep pushing the ceiling of general capability. Enterprises will increasingly capture that capability in narrower systems they can own, measure, tune, and price.
Evals are what make depth real
Without evals, “deep” is just branding. A model is only deep in a workflow if you can prove it handles the cases that define the workflow. That means a durable eval suite: real examples, structured expected outputs, rubric checks where needed, slice-based reporting, and production monitoring that catches drift after the demo is over.
This is why I think evals are the real moat. Once the company owns the tests, it owns the definition of good. It can compare a new frontier model against Claude, a fine-tuned Llama derivative, a tiny domain model, or a hybrid stack without restarting the argument from vibes. The eval set becomes institutional memory.
This is also where fine-tuning becomes less mystical. You do not fine-tune because fine-tuning is fashionable. You fine-tune when the evals show that repeated behavior should be learned into the model instead of re-explained in a prompt. You fine-tune when the workflow has stable patterns, enough examples, a measurable target, and a cost or latency profile that justifies the work.
The stack gets smaller and more specialized
The enterprise AI stack will not be one giant model doing everything. It will be a portfolio. A small classifier routes the work. A fine-tuned extractor handles the stable document pattern. A frontier model resolves ambiguity. A deterministic service validates the output. A human reviews the thin slice where judgment matters. A telemetry layer records cost, latency, model version, confidence, eval status, and user impact.
This is less glamorous than the all-knowing agent demo. It is also much closer to how serious software gets adopted. Enterprises do not want magic in their core workflows. They want leverage they can understand, constrain, observe, and improve.
The same logic is driving the headless-agent shift. Agents do not need prettier dashboards. They need durable tool contracts, permissions, logs, and safe ways to act. Deep models fit naturally into that world because they are not trying to be charming generalists. They are trying to execute one slice of work with unusual consistency.
What changes for builders
If deep models are the enterprise direction, the builder’s job changes. The hard part is no longer picking the cleverest prompt. The hard part is designing the capability loop: what data goes in, what shape comes out, how the output is checked, when the system escalates, how model use is logged, how costs are allocated, how behavior is downgraded when a cheaper model is selected, and how the company knows whether the workflow is getting better.
That is a software-engineering problem, an operations problem, and a product problem at the same time. It is also why the best enterprise AI teams will look less like prompt teams and more like small infrastructure teams with a strong eval habit.
The end state
General-purpose models made this era possible. They proved that one model could reason across domains well enough to change how we build software. But the next stage of enterprise value will come from depth: models and systems that know one company’s work better than any rented generalist can.
That is the future I expect: frontier models at the edge of capability, open-source models inside the enterprise, fine-tuned specialists in the workflow, evals as the control plane, and telemetry underneath everything. The companies that get there first will not just automate more tasks. They will own the operating intelligence that makes those tasks uniquely theirs.
