How Do You Evaluate AI Agents in Production?
How to evaluate AI agents in production: LLM-as-a-judge, trajectory evals, golden datasets, and the CI/CD quality gates that catch regressions before they reach users.
Evaluate AI agents in production and you immediately hit a gap that classical software testing cannot close. The demo works. The sandbox conversations are coherent. Then you go live and discover the same agent that passed every manual test calls the wrong tool on 8% of requests, hallucinates numbers under specific input patterns, and silently degrades in output quality after a prompt update — with no alert firing. Traditional unit and integration tests assume deterministic outputs: same input, same expected output. AI agents break that contract. They generate natural language, route through tool chains, and make context-dependent decisions that vary between identical runs.
The gap between a convincing demo and a reliable production agent is, in most cases, an evaluation architecture problem — not a model capability problem. This guide covers the complete production eval stack: the four evaluation layers that catch different failure modes, how to build a golden dataset from real traffic rather than team assumptions, trajectory evaluation for multi-step agents, LLM-as-a-judge implementation and its failure modes, and how to wire evals into CI/CD so regressions get caught at the pull request stage, not in your users' inboxes. For the deployment and observability infrastructure that evaluation builds on top of, see our LLMOps in enterprise production guide.
Why Evaluating AI Agents Requires a Different Approach
The standard software testing toolkit fails for AI agents for three structural reasons. First, outputs are non-deterministic — a traditional equality assertion fails two semantically correct responses that differ by a single word. You need a scoring function, not a string comparison. Second, agents operate through multi-step trajectories: goal intake, tool selection, parameter construction, intermediate reasoning, result synthesis. The final answer can be correct even when intermediate steps were wrong, and trajectory failures are invisible unless you inspect the path. Third, agent failure modes are diverse — wrong tool, correct tool with wrong parameters, hallucinated intermediate facts, reasoning drift mid-chain — and no single metric surfaces all of them.
- →Output quality failures: the final response is factually wrong, incomplete, or off-tone — caught by output-level scoring metrics and LLM judge evaluation against a quality rubric
- →Tool-use failures: the agent selected the wrong tool or passed incorrect parameters — only visible in trajectory evaluation of the tool-call sequence, not the final output alone
- →Trajectory inefficiency: the agent reached the right answer via an unnecessarily long or expensive sequence of steps — caught by step-efficiency metrics, and directly correlated with inference cost
- →Regression failures: quality was acceptable before a prompt update or model version change, then degraded silently — caught exclusively by CI/CD eval gates running against a fixed golden baseline
The Four Evaluation Layers Every Production Agent Needs
A robust production eval pipeline layers four complementary evaluation methods, each catching failure modes the others miss. Using any single method in isolation leaves blind spots that reliably surface as production incidents.
- →Layer 1 — Deterministic checks (near-zero cost): programmatic assertions on output format, JSON schema validation, required field presence, length constraints, and prohibited content patterns such as PII detectors and domain keyword blocklists. Run in milliseconds and should catch every structurally malformed output before it reaches a scoring step
- →Layer 2 — Heuristic scoring (low cost): rule-based quality metrics such as ROUGE-L for summarization tasks, keyword presence checks, and custom domain rules for structured outputs. Useful where ground truth is well-defined; cannot assess semantic nuance or novel failure modes outside the defined rules
- →Layer 3 — LLM-as-a-judge (medium cost, high signal): a capable LLM receives the agent's input, output, and a scoring rubric, and returns a structured quality score with rationale. Effective for coherence, faithfulness, helpfulness, and tone. Requires calibration against human-annotated examples — the judge should achieve a Pearson correlation above 0.7 with domain expert verdicts before being trusted as a quality gate
- →Layer 4 — Human review (high cost, ground truth): a weekly sample of 50 to 100 production traces reviewed by domain experts or quality annotators. Used to calibrate LLM judges, validate the golden dataset, and surface failure categories automated evals miss. Every human review finding should update the golden dataset
Building a Golden Dataset from Real Production Traffic
A golden dataset is a curated set of (input, expected quality criteria) pairs against which every eval run is scored. It is the most critical asset in your evaluation infrastructure. The most common failure mode is building it entirely from synthetic examples written upfront by the team that built the agent — those examples reflect what the team imagined users would ask, not the actual distribution of production inputs, and systematically miss the edge cases that cause real failures.
The production pattern: instrument your AI layer with OpenTelemetry to capture every trace — inputs, tool calls, intermediate outputs, final responses. Run automated eval scoring on each trace. Flag low-scoring samples for human review. Annotate and add them to the golden dataset. Over four to six weeks of production traffic, this grows a dataset that is genuinely representative of real failure modes. Supplement with human-crafted adversarial examples covering known edge cases, and small quantities of synthetic expansion for underrepresented input patterns. Target 200 to 500 high-quality examples before wiring evals into CI/CD — a focused golden dataset of 200 real examples consistently outperforms 5,000 synthetic ones.
Trajectory Evaluation: Score the Path, Not Just the Answer
Trajectory evaluation scores the full sequence of decisions an agent makes — tool selection, parameter values, intermediate reasoning, retry behavior — rather than only the final response. It is the most important evaluation technique for multi-step agents and the most commonly skipped in early implementations, because it requires structured trace capture rather than just input/output logging. Production data shows agents evaluated only on final output quality pass 20 to 40% more test cases than trajectory evaluation reveals, because intermediate-step failures are completely invisible to output-only scoring.
- →Task completion rate: did the agent achieve the user's goal, as determined by a programmatic success condition or a judge evaluating the final state? This is the primary business metric — everything else supports it
- →Tool selection accuracy: for each step in the trajectory, did the agent call the correct tool from the available set? Compared against expected tool calls in golden dataset examples
- →Tool parameter correctness: were the parameters passed to each tool semantically correct — right values, right format, appropriate scope? Evaluated by parameter-level heuristics or an LLM judge on the tool-call span
- →Step efficiency: how many tool calls did the agent make relative to the minimum required to complete the task? Inefficient trajectories indicate reasoning drift and inflate inference cost directly
- →Reasoning quality: is the intermediate reasoning at each step internally consistent, factually grounded, and progressing toward the goal? Evaluated by an LLM judge on the chain-of-thought spans captured in the trace
Trajectory evaluation requires structured tracing infrastructure. Each tool call, intermediate output, and reasoning step must be captured as a structured span in your observability backend. OpenTelemetry with a GenAI-compatible schema is the production standard. Purpose-built agent tracing tools — LangSmith, Braintrust, Arize Phoenix — build on this foundation and add eval-specific views and regression comparison. This infrastructure requirement connects directly to the orchestration layer covered in our multi-agent AI orchestration guide.
LLM-as-a-Judge: When to Use It and Where It Breaks
LLM-as-a-judge is the most scalable approach for evaluating nuanced quality properties — coherence, faithfulness to source material, helpfulness, tone — that rule-based metrics cannot assess. The judge receives the user input, the agent output, optionally retrieved context, and a structured scoring prompt, and returns a numeric score with a rationale. At the CI/CD stage, running a judge against 200 to 500 golden examples per pull request is tractable in both cost and time. The problems arise when teams skip calibration or apply the judge at production volume without a tiering strategy.
- →Calibrate before you gate: validate the judge against a human-annotated reference set on your specific task domain. A judge calibrated on customer support summarization does not automatically transfer to code generation or structured data extraction — build domain-specific rubrics and re-calibrate per task type
- →Use a different model as the judge: a model evaluating its own outputs has self-favorability bias and scores them higher than human reviewers do. Use a separate frontier model, or a dedicated evaluation model, as the judge to eliminate this bias
- →Mitigate positional bias: some models score outputs that appear first in the prompt more favorably. When comparing two candidate outputs, randomize the presentation order and average the scores
- →Do not run full LLM judge on every production request: at production volume, the cost grows faster than your inference spend. Run cheap deterministic checks first, escalate to LLM judge on sampled traffic of 1 to 5% and anomaly-flagged traces, and reserve full judge runs for CI/CD gates and weekly audits
- →Watch for verbosity bias: many judge models reward longer outputs regardless of quality. Write rubrics that explicitly penalize unnecessary length and reward conciseness where appropriate for your task type
Wiring Evals into Your CI/CD Pipeline
An eval pipeline that only runs manually is a monitoring tool, not a quality gate. The production standard wires evaluations into CI/CD: any pull request that touches prompt templates, agent logic, tool definitions, or model version configuration triggers an automatic eval run against the golden dataset, and a regression below the established baseline blocks the merge.
- →Baseline your scores first: before any pipeline change, record the eval scores across all metrics against the golden dataset. Store them in version control alongside your prompt templates and agent configuration so the baseline travels with the code
- →CI step on every relevant PR: a CI job that runs your eval framework — DeepEval, Promptfoo, or a custom harness — against the full golden dataset. Fail the job if any metric drops below the defined threshold
- →Surface trace diffs in PR reviews: the eval report should show which specific golden examples changed score, with before-and-after outputs, so reviewers can judge whether a regression is a bug or an acceptable trade-off
- →Set metric-specific thresholds: a 3% drop in task completion rate is a breaking change; a small drop in step efficiency may be acceptable if output quality improved. Define thresholds per metric rather than a single aggregate score
- →Close the feedback loop from production: when a monitoring alert fires or a user reports a failure, capture the trace, annotate it, and add it to the golden dataset. This progressively makes the CI/CD gate more representative of real usage and reduces the gap between benchmark scores and observed production quality
Frequently Asked Questions
What metrics should you use to evaluate AI agents?
The five production-critical metrics are: task completion rate (did the agent achieve the user's goal?), tool selection accuracy (did the agent call the right tools with the right parameters?), output quality score (factual correctness, coherence, helpfulness — evaluated by LLM judge), step efficiency (did the agent take a direct path?), and latency at p95 (is it fast enough for the use case?). Track all five simultaneously — optimizing any single metric in isolation consistently creates blind spots in the others.
What is trajectory evaluation for AI agents?
Trajectory evaluation scores the full sequence of an agent's decisions — tool calls, parameter values, intermediate reasoning, retry behavior — rather than only the final output. It requires structured traces capturing each step. Agents evaluated only on final output quality pass 20 to 40% more test cases than trajectory evaluation reveals, making it essential for multi-step agents where intermediate-step failures are invisible to output-only scoring.
How do you build a golden dataset for LLM evaluation?
Build from three sources: real production traces with PII removed and expert annotation (highest signal), human-crafted adversarial examples covering known failure modes and edge cases, and small quantities of synthetic variants for underrepresented input patterns. Target 200 to 500 high-quality examples. Grow it continuously by feeding annotated production failures back into the dataset — this makes the eval gate progressively more representative of real usage rather than the team's upfront assumptions.
What is LLM-as-a-judge and when should you use it?
LLM-as-a-judge uses a frontier model to score agent outputs against a structured quality rubric. Use it for evaluating nuanced properties — coherence, faithfulness to source material, helpfulness, tone — that deterministic metrics cannot capture. Validate the judge against human-annotated examples first and target Pearson correlation above 0.7 with domain expert scores before using it as a quality gate. Do not run it on every production request at volume — apply it to CI/CD gates and sampled production audits, not real-time request evaluation.
How is evaluating AI agents different from traditional software testing?
Traditional software testing uses equality assertions on deterministic outputs. AI agents produce semantically varied outputs from identical inputs, operate through multi-step tool-use trajectories, and fail in ways unit tests cannot detect — wrong tool parameters, hallucinated intermediate reasoning, silent quality drift after a model update. Agent evaluation requires semantic scoring functions, trajectory tracing, golden datasets, and LLM judges rather than the standard equality-based assertion toolkit.
How Belsoft Builds AI Evaluation Infrastructure
Most engineering teams ship their first production agent before an eval infrastructure is in place and discover the gap when quality incidents surface in production. Belsoft builds the evaluation layer as part of every AI system engagement: golden datasets seeded from real production traffic, eval pipelines wired into CI/CD so regressions block the merge, LLM judges calibrated to the specific task domain, and production monitoring with anomaly alerting on quality metrics. The result is a system where quality is measured continuously rather than assumed.
If you are planning an AI product or agent system and want evaluation built in from day one — not retrofitted after the first production incident — explore our AI & automation engineering service or book a technical session with our team. We scope the eval infrastructure alongside the AI feature itself, so you ship with a quality baseline rather than shipping and hoping.
“A production AI system without evals isn't deployed software — it's a live experiment with no control group.”
Written by
Belsoft Team
More from the blog
Ready to build?
Let's talk about your project.
30 minutes. No pitch. We map your requirements and tell you honestly what it will take.
Book a Strategy Call