Running AI in Production: What Actually Works

Most AI projects succeed as demos. Far fewer survive contact with real users, real data, and real operational pressure.

There is a wide gap between an AI demo and a production AI system. The demo impresses a room. The production system runs without supervision, on data it was not tuned for, handling edge cases nobody anticipated, under the scrutiny of users who will lose confidence the moment it gives them something wrong.

Most enterprise AI projects stall somewhere in that gap. Not because the underlying models are inadequate, but because the engineering discipline required to bridge that gap is different from what it takes to build the demo. Different skills, different priorities, different failure modes.

After building AI systems for clients in financial services, legal, and operational settings, here is what we have learned about what production AI actually requires.

Why AI pilots fail to reach production

The most common reason an AI pilot never reaches production is not model quality. It is that nobody built a reliable way to measure whether the system is producing good outputs. When it does not, nobody catches it until a user does.

A demo is typically evaluated by the people who built it, against data they selected, in a controlled setting. Production is the opposite: uncontrolled data, users who did not design the system, scenarios nobody planned for. The gap between those two evaluations is where most AI projects die.

The second common reason is data quality. Clean, curated data in a Jupyter notebook produces clean, impressive results. The actual data in an enterprise is a different problem entirely: five years of PDFs scanned at varying quality, Excel files with merged cells and inconsistent naming conventions, and database records where the same concept has been entered twelve different ways. A system that was not designed to handle messy data will produce unreliable outputs, and users will notice.

The third reason is integration. An AI capability that exists as a standalone tool, separate from the workflows where it is needed, has a very low adoption ceiling. The value of AI in an enterprise setting comes from embedding it in the places where decisions are actually made. That requires engineering the integrations as well as the model.

Why Most AI Pilots Never Reach Production →

Evaluation is not a launch task

Teams that treat evaluation as something you do before you ship almost always end up with production systems that gradually drift in quality without anyone noticing. Run the benchmark, hit a threshold, declare it ready. That is not enough.

Production AI evaluation is an ongoing operational concern. The questions you need to answer continuously are: Is the system producing outputs of acceptable quality today? Has quality changed since last week? Are there categories of query where it is reliably worse? Are users correcting or discarding the outputs?

That requires measurement infrastructure beyond benchmark scores: evaluation tied to what users actually do with the outputs. If users are correcting AI suggestions 40% of the time in a particular category, that is a quality problem. If you are not measuring correction rates, you will not know.

For language model systems specifically, this means building evaluation datasets that reflect your actual domain, not generic benchmarks. It means running evals on a sample of real production traffic alongside synthetic test cases. It means having people in the loop who understand the domain and can assess output quality in a way that an automated scorer cannot.

Monitoring AI Systems: What to Measure Beyond Uptime →

Explainability is an operational requirement

Ask a user to trust an AI system that cannot explain its outputs, and you will learn quickly how short that goodwill extends. In regulated sectors, explainability is a compliance requirement. But even outside of compliance, a system that produces outputs users cannot scrutinise or challenge will be worked around rather than adopted.

Explainability looks different depending on the application. For a retrieval augmented system, it means showing which source documents informed an answer and where in those documents the relevant information came from. For a classification system, it means surfacing the features or signals that drove a decision. For a recommendation system, it means being able to articulate why a particular item or action was suggested.

Designing for explainability is substantially easier when it is built in from the start. Systems where the model is a black box that produces outputs, with the explainability layer bolted on afterward, tend to provide explanations that are technically accurate but operationally useless. The explainability needs to be part of the architecture, not an afterthought.

Designing Explainable AI: When 'Trust the Model' Isn't Good Enough →

Hallucinations: the production problem that benchmarks miss

Hallucination happens when a language model generates confident, plausible output that is factually wrong. It is among the hardest problems to address in production AI systems. Benchmark accuracy rates give a misleading sense of safety: a system can perform well on standard evaluations and still hallucinate on the specific queries your users actually send.

In production, this requires mitigations at the architectural level as well as the model level. Retrieval augmented generation constrains the model to a defined corpus rather than its full parametric knowledge. Structured outputs and schema validation reduce the surface area for confabulation. Human review workflows catch failures before they reach downstream decisions. Confidence scoring surfaces uncertain outputs for additional scrutiny rather than presenting them with equal weight to reliable outputs.

No combination of these removes the problem entirely. What they do is reduce it to a manageable rate and make it visible when it occurs. The systems that fail in production are those that ship without these mitigations and discover the rate is unacceptable through user complaints rather than through instrumented monitoring.

AI Hallucinations in Production: Mitigation Strategies That Work →

Human review as a design choice

There is a tendency to treat human review of AI outputs as a temporary measure, something you do until the model is good enough to remove it. For most enterprise AI applications, this framing is wrong.

Human review is not a fallback for inadequate AI. It is a design choice that reflects where human judgement adds value and where automation does not. A well designed review system routes confident, low risk outputs through without review, surfaces uncertain or sensitive outputs for human attention, and captures reviewer decisions in a form that can improve future performance.

The operational benefit is significant. It allows you to launch with a tighter confidence threshold, expand the automation envelope gradually as live performance is established, and maintain meaningful human oversight of decisions that warrant it without creating a review bottleneck that eliminates the efficiency gains you were after.

It also protects against the worst failure mode: a model that degrades silently and systematically because nobody was looking. When humans are in the loop on a sample of outputs, degradation is visible before it becomes a problem.

Human in the Loop Design for AI Powered Workflows →

Security and data isolation

Most AI security discussions focus on model level threats: adversarial inputs, prompt injection, jailbreaks. These are real concerns. But in enterprise settings, the more immediate risk is usually simpler: data from one user, tenant, or context reaching another.

In an AI system serving multiple tenants, where multiple clients or users interact with a model that has access to a shared knowledge base, the access control requirements are strict. The retrieval layer must enforce data isolation. The context window must not contain information from another user's session. Logs must not persist query content in a form that could leak across boundaries.

These requirements interact with AI architecture in ways that are not always obvious. Vector databases do not automatically enforce row level access control. Shared embedding spaces can surface semantically similar content across tenant boundaries. Context caching, if implemented naively, can create contamination across users or tenants. Each of these is solvable, but only if the security model is part of the design from the start, not retrofitted onto a working prototype.

AI Security: Access Control and Data Isolation for Enterprise Systems →

Cost in production is not the same as cost in the demo

API pricing for language model calls is easy to underestimate at demo scale. A proof of concept that makes a few hundred calls a day has almost no visible cost. A production system that makes that same call for every document in a 100,000 document corpus, for every user session, on every workflow trigger, has a cost that needs to be designed around.

The practical controls are: caching responses for queries that recur frequently; choosing smaller models for tasks where large-model capability is unnecessary; batching calls where latency allows; streaming responses to reduce perceived wait time without reducing actual compute costs; monitoring cost per transaction so that unexpected spikes are visible before they become budget problems.

Beyond the cost per call, production AI systems often incur significant costs in the vector store, the embedding pipeline, and the reranking layers. These components do not appear in a prototype, but they are necessary at scale. Designing the cost model before the architecture is set means these components get the attention they need, rather than becoming a surprise in the first real invoice.

When AI is not the right answer

This is the question that gets skipped most often, because it tends to arrive after the technology choice has been made.

AI is not the right tool for every problem. Rule based systems are more predictable, easier to audit, and cheaper to operate for problems where the logic is known and stable. Deterministic algorithms outperform probabilistic models when precision matters more than generalisation. Structured search beats semantic search when users know exactly what they are looking for.

The cases where AI adds genuine value are where the problem involves natural language, where the input space is too large or varied for rules to cover, where some tolerance for uncertainty is acceptable, and where the cost of human handling is higher than the cost of imperfect automation.

When those conditions are met, AI can be transforming. When they are not, the result is usually an expensive system that is difficult to maintain and that a simpler approach would have handled better. The right answer sometimes is: not AI, or not AI for this part of the problem.

What a production AI system looks like in practice

A production AI system has evaluation running continuously. It has monitoring that distinguishes model quality issues from infrastructure issues from data quality issues. It has clear lines of human oversight for the decisions that warrant it. It has tight access control and data isolation. Its costs are instrumented and bounded. It has a plan for what happens when the model provider changes pricing, deprecates a model, or has an outage.

None of that is exotic engineering. It is the same discipline that any production system requires, applied to a component that treats uncertainty as a core concern.

The AI projects we have seen succeed in production are the ones where these concerns were part of the design from the first conversation, not the ones where a working demo got thrown into production and the team hoped for the best.

More in this series

Building an AI system that needs to work in the real world?

We build AI systems that are evaluated, monitored, and designed to stay trusted once they are live. Talk to us about what you are working on.

Get in touch
WhatsApp