Back to Blog

Evals and Monitoring for LLM Features in Production: A Practical Setup

How to build an eval harness and production monitoring for LLM features that actually catches regressions. The test types, the metrics, and the feedback loops that keep AI systems honest over time.

Evals and Monitoring for LLM Features in Production: A Practical Setup
Kai Token
Kai Token
31 Mar 2026 · 8 min read

The pattern that burns most teams

A team ships an LLM feature. It works well in testing. It launches. Users are happy for a month.

Then quality quietly starts to slip. The model provider ships an update. A prompt gets edited in a deploy. A new document source gets added to the RAG corpus. No single change is dramatic, but the composite effect is that outputs are worse than they were.

The team does not notice because there is no system for noticing. Users complain for a while, then stop complaining and start using the feature less. The metrics dashboard shows usage declining and nobody knows why.

This is avoidable. The teams that avoid it have two things in place: an eval harness that runs on every significant change, and production monitoring that catches drift before users do.

This post is the practical setup.

What an eval actually is

An eval is a test, but the shape is different from unit tests.

A unit test has a deterministic input and a deterministic expected output. An eval has a representative input and some definition of "good output." The definition of good can be a string match, a regex, a set of required keywords, a structured output that validates against a schema, or an LLM grading the output against a rubric.

The point is not that every eval passes. The point is that the eval pass rate gives you a signal that captures quality meaningfully.

The three eval types to start with

You do not need a dozen eval categories on day one. Three cover most of what matters.

Regression evals. A set of representative inputs and known-good outputs. This catches "did this prompt edit make things worse on the cases I already cared about." Build this first. Start with 20 to 50 cases drawn from real usage. Grow it as real production issues surface.

Capability evals. Inputs that test specific capabilities the feature should have. If your feature summarizes structured documents, you want evals for "correctly identifies the key fields," "does not fabricate information absent from the source," "handles edge cases like truncated input." Each capability gets a handful of cases.

Refusal evals. Inputs that should be refused or handled specially. Off-topic requests. Adversarial prompts. Prompt injection attempts. If your feature should refuse, make sure it does, reliably.

Three categories, maybe 100 total cases, is a good starting point. You will grow this over time as you see what production actually surfaces.

For agentic systems (tool-using AI, multi-step workflows) the layered model that is becoming standard in 2026 is worth adopting: data certification, unit tests of tool selection and state, integration tests of multi-step workflows, end-to-end simulation with fault injection, and adversarial red-team testing. Production CI/CD gates on the eval score. Most teams do not need all six layers on day one, but the progression is a useful roadmap.

LLM-as-judge, when it actually works

For tasks where the right answer is not a single string, LLM-as-judge grading is useful. Give another LLM the input, the output, and a rubric. Ask it to score on a scale or give a pass/fail.

It works, with caveats.

  • The rubric has to be specific. "Is the summary accurate" is a bad rubric. "Does the summary list the correct primary diagnosis, the correct active medications, and no fabricated information" is a good rubric.
  • Use a different model from the one being evaluated when you can. Otherwise you are testing a model against its own biases.
  • Validate the judge against human labels. Take a sample, grade them manually, and check that the judge agrees with humans at a reasonable rate.
  • Revisit the judge periodically. Model providers ship updates. Judges drift.

LLM-as-judge is not magic but it is the best tool for grading non-deterministic outputs at scale.

When to run evals

A few natural points.

Every prompt change. Prompt engineering is where most regressions happen. Every edit to a production prompt runs the eval suite before merging. Treat this like a test gate.

Every model version change. When you switch from Claude Sonnet 3.5 to Claude Sonnet 4, or from GPT-4o to GPT-4.1, run the full eval suite. The output distribution changes subtly, and some of those changes matter.

Every RAG corpus change. If you are running RAG, corpus updates can change retrieval behavior. Run at least the retrieval evals.

On a schedule. Nightly or weekly. Catches the case where nothing in your code changed but something upstream did. Vendors ship silent updates.

Before every release. Same as a test suite. Green suite is a gate for deploy.

The metrics that catch real problems in production

Evals catch regressions in known scenarios. Production monitoring catches problems in scenarios you did not anticipate.

The dashboard that covers most of what matters:

Volume and latency. Standard stuff. Request rate, P50/P95/P99 latency, error rate. If these are off, something is broken at a level that has nothing to do with AI.

Token cost. Cost per request, total cost, cost by feature. Easy to let this slide. A tiny prompt edit that adds a lot of examples can double your inference costs overnight.

Output length distribution. A sudden change in average output length is a signal. The model is now verbose. Or it is truncating. Either way, investigate.

Refusal rate. How often the system refuses or returns a fallback. A spike usually means something in the input distribution shifted. A drop usually means your refusal logic broke.

User feedback rate. Thumbs up and down on outputs, if you expose them. The absolute rate matters less than the change over time.

Structured output validation rate. If you are returning JSON or structured output, the rate at which the output successfully parses. A drop is an immediate action item.

Grounding signals. For RAG features, retrieval hit rate, chunk relevance scores, the rate at which responses cite retrieved content. A grounding signal that moves is a content or retrieval problem.

None of these alone tell you what is wrong. Together they narrow the search space fast.

The feedback loop that makes the system better

An eval suite is not static. The moat is the feedback loop between production and the suite.

A user flags a bad output. The thumbs-down signal goes into a queue. An engineer reviews the flagged outputs weekly, or a system routes them for review. The ones that represent real failure modes get turned into eval cases.

Now the next time a model update or prompt edit is considered, the eval suite includes this case. The same failure does not happen twice.

Over a year, the eval suite grows from 100 cases to 500 to 2000 cases. It becomes a regression test against every real quality problem the system has ever had. This is the compounding asset. Most teams do not build it because it is tedious. The teams that do pull ahead.

Tooling

The tooling question matters less than the discipline. A short view of the current landscape as of 2026:

  • Homegrown. A Python or TypeScript test runner that calls the model, compares outputs, and reports pass/fail. Integrates with your CI. Fine for most teams starting out.
  • Managed eval and observability platforms. LangSmith, Braintrust, Langfuse, Confident AI, Arize, Helicone, and others. Useful if you are running a lot of evals and want better tooling for dataset management, traces, diffing, LLM-as-judge scoring, and production observability. Each has a different strength (depth of metrics, tracing, cost monitoring, framework-native integration). Pick based on what you actually need, not feature counts.
  • Open source frameworks. DeepEval for pytest-native assertions, Promptfoo for prompt regression testing, RAGAS for RAG-specific metrics, Arize Phoenix for observability. Stackable with any provider.

Pick one and use it. Do not spend two months evaluating eval tools while your production system has no evals.

A threshold-setting note. Teams that run this well tend to gate CI on concrete numbers rather than vibes. Common gates: task success rate at or above a baseline, tool selection accuracy above 0.9 for agentic systems, faithfulness scores above 0.8 for grounded outputs. The exact numbers depend on your domain. The discipline of having numeric gates is what matters.

A minimum setup for a new feature

For a team shipping a new LLM feature next quarter:

Week one. Set up the eval harness. Write the first 20 regression cases from real expected usage. Hook it into CI.

Week two. Add capability evals for the top five capabilities the feature needs. Add refusal evals for the top three scenarios that should be refused.

Week three. Wire up production logging. Token cost, latency, output length, refusal rate, structured validation rate.

Week four. Launch behind a flag. Route user feedback into a review queue.

After launch. Weekly review of flagged outputs. Add the ones that represent real failure modes as eval cases. Ship prompt and model updates only after full eval suite passes.

A month of work to set this up. A year of reliable quality in production. Far cheaper than shipping something that silently degrades.

The thing nobody tells you

Evals are boring to build. Nobody wants to write rubrics. Nobody wants to review flagged outputs. It is the least glamorous part of shipping AI features.

It is also the reason some AI features keep getting better in production and others silently rot. The teams that treat evals as a first class investment end up with systems their users trust. The teams that treat evals as something to get to later end up rewriting features because they cannot figure out why quality dropped.

Build the harness first. Everything else is easier when you have it.


Kai Token leads AI engineering at Fraktional. Works on eval and observability harnesses for production AI features across regulated industries. Believes evals are the most underrated asset in a production AI system.

Related Articles

From seamless integrations to productivity wins and fresh feature drops—these stories show how Pulse empowers teams to save time, collaborate better, and stay ahead in fast-paced work environments.