Monitoring AI Systems: What to Measure Beyond Uptime

A conventional application is healthy when it's up, responding within latency targets, and not throwing errors. An AI system can satisfy all three of those conditions and still be producing outputs that are wrong, degraded, or quietly drifting from the quality it had at launch. Traditional monitoring doesn't catch this. Most teams find out about AI quality degradation through user complaints.

The gap is predictable. Infrastructure monitoring was developed for systems whose behaviour is deterministic: the same input produces the same output, and an error is an error. AI outputs are probabilistic, and quality is a distribution rather than a binary. Monitoring an AI system requires thinking about what healthy looks like across that distribution, beyond whether the system is responding.

Building this kind of monitoring before a system goes live is substantially easier than building it after. The evaluation infrastructure and the quality definitions have to be established while the team has clear access to ground truth.

The monitoring gap in AI systems

Infrastructure monitoring for an AI system looks the same as for any API: latency, error rate, throughput, availability. These metrics are necessary but they are not sufficient. A system with 99.9% uptime and response times under 200 milliseconds is still failing if 30% of its responses are hallucinations or the retrieval quality has degraded since the knowledge base was last updated.

The monitoring gap is widest in two scenarios. The first is silent quality degradation: the model is updated by the provider, a new model version changes behaviour slightly, or accumulated drift in the knowledge base means retrieval is returning stale or less relevant content. None of these produce errors. The system continues to respond normally and the infrastructure metrics look fine.

The second scenario is performance variation across query types. A system that handles one class of query well and another class poorly may appear healthy in aggregate while consistently failing a segment of users. Aggregate quality metrics mask this. Only segmented evaluation reveals it.

Offline evaluation: the baseline you run before anything else

The foundation of AI quality monitoring is an offline evaluation pipeline: a fixed dataset of representative queries with verified answers, run against the system periodically and after every significant change. The results are tracked over time and compared across versions.

Building this dataset is the bulk of the work. A representative sample of queries means covering the range of topics, question types, and user populations the system will serve, including the cases that are hard to evaluate. Verified answers should be checked against the authoritative source, not copied from a previous version of the model.

The offline evaluation pipeline catches the quality changes that infrastructure monitoring misses: model updates that change output style or accuracy, retrieval pipeline changes that affect what content reaches the generation layer, and prompt changes that improve performance on some query types while degrading it on others. Without it, these changes are invisible until users complain.

Online evaluation: sampling production traffic

Offline evaluation covers the query types you anticipated. Online evaluation covers the ones you did not. Sampling a percentage of production queries, with their actual outputs, for quality review provides a continuous signal on whether the system is handling real user behaviour correctly.

The review can be manual, automated, or a combination. Manual review by domain experts provides the highest quality signal but does not scale. Automated scoring using an evaluation model provides a rougher signal, but it scales to much higher sample rates. A well designed system uses automated scoring to triage, routing uncertain outputs for manual review.

Online evaluation also reveals distribution shift: the gradual change in what users are actually asking compared to what the system was designed for. If queries cluster in a new topic area, that is a signal to update the knowledge base and run retrieval quality tests again. If a previously uncommon query type is becoming common, that is a signal to add it to the offline evaluation dataset.

Cost as a monitoring dimension

AI inference cost is a function of prompt length, response length, and request volume. All three can change in production in ways that aren't visible in infrastructure metrics. A prompt template that grows as context is added, a user population that asks more verbose questions than the evaluation set assumed, or a retrieval pipeline that returns longer chunks than intended can all drive costs upward without any change in functionality.

Cost monitoring should be set up alongside quality monitoring, with alerts on per query cost as well as total cost. A per query cost increase that is not explained by a deliberate change is a signal that something in the pipeline has changed. It may improve quality, or it may simply be waste.

Cost optimisation decisions are meaningfully better when made with data, whether the decision is to shorten context windows, use cheaper models for certain query types, or cache responses. Teams with cost monitoring can make these decisions with a clear view of the cost and quality tradeoff. Teams without it tend to overspend because they do not know the marginal cost of the system, or underinvest in quality because they do not know what a cheaper approach would cost them in accuracy.

Latency as a quality signal

In AI systems, latency and quality correlate in ways they do not in conventional systems. Queries that require more model computation tend to produce longer responses and take longer to complete. A sudden increase in median response time often reflects a change in the distribution of queries: more complex questions, longer retrieved contexts, or a model behaving differently under a certain class of input.

Tracking latency by query type, rather than just in aggregate, surfaces this. A retrieval latency spike on a specific document type indicates an indexing problem. A generation latency increase for a specific category of query might indicate the model is producing more verbose outputs, which is itself a signal worth investigating.

Setting latency budgets by query type rather than in aggregate is more useful for diagnosis and provides better user experience data: an acceptable response time for a quick factual query is different from an acceptable response time for a complex summary task.

You can't improve what you can't see

AI systems that are monitored properly improve over time. Problems are caught early, before they affect many users. Quality trends are visible, so the team can make informed decisions about model updates, knowledge base maintenance, and prompt engineering.

AI systems that aren't monitored tend to drift. The team learns about quality problems from users, by which time the problem has usually been affecting users for longer than it should have. Building the monitoring infrastructure before launch is the discipline that determines which of these paths a system takes.