====== Agent-as-a-Judge ======
**Agent-as-a-Judge** is an advanced AI evaluation paradigm that uses [[autonomous_agents|autonomous agents]] — equipped with planning, tool use, memory, and multi-agent collaboration — to assess complex AI outputs more reliably than traditional LLM-as-a-Judge methods.((A Survey on Agent-as-a-Judge. [[https://arxiv.org/abs/2601.05111|arXiv:2601.05111]]))

===== LLM-as-a-Judge vs Agent-as-a-Judge =====
The [[llm_as_judge|LLM-as-a-Judge]] approach relies on large language models for scalable, single-pass assessments of AI outputs. While effective for simple tasks, it struggles with biases (position bias, verbosity bias, self-enhancement bias), shallow reasoning, and lack of real-world verification for complex, multi-step tasks.

Agent-as-a-Judge overcomes these limitations by enabling agents to:((Stanford HAI. "When AIs Judge AIs: The Rise of Agent-as-Judge." [[https://scale.stanford.edu/ai/repository/when-ais-judge-ais-rise-agent-judge-evaluation-llms|scale.stanford.edu]]))

  * **Decompose tasks** into sub-evaluations with hierarchical reasoning
  * **Use tools** for evidence collection and verification (code execution, web search, database queries)
  * **Collaborate** via multi-agent setups for diverse perspectives
  * **Maintain persistent memory** for fine-grained, context-aware evaluations

This transforms evaluation from monolithic scoring into an autonomous, hierarchical reasoning process where agents can actively investigate claims, run code, and synthesize evidence into coherent verdicts.

===== Trajectory Evaluation =====
A key strength of Agent-as-a-Judge is evaluating **trajectories** — sequences of multi-step actions or reasoning chains. Traditional [[llm_as_judge|LLM-as-a-Judge]] provides a single coarse-grained score, but Agent-as-a-Judge can:

  * Persist intermediate evaluation states across steps
  * Autonomously plan evaluation strategies across reasoning chains
  * Synthesize evidence from multiple steps into coherent assessments
  * Pinpoint specific flaws in reasoning that coarse-grained scores miss

This capability is particularly valuable for evaluating coding agents (where execution results matter), research agents (where source verification is needed), and planning agents (where step validity is critical).

===== Multi-Agent Judging Panels =====
Multi-agent collaboration mitigates individual biases and refines judgments through coordination. Key approaches include:

  * **Multi-Agent LLM Judge**: Iterative prompt refinement where multiple agents debate scores
  * **SAGEval**: [[meta|Meta]]-judge oversight where a supervisory agent monitors and arbitrates evaluator disagreements
  * **ChatEval**: Agent interaction where evaluators engage in structured dialogue to reach consensus
  * **FACT-AUDIT / AGENT-X**: Dynamic guidelines with fact-checking loops for domain-specific evaluation
  * **CodeVisionary**: Agents that execute code to verify functional correctness claims

These panels enable adaptive, robust evaluation across domains including code generation, creative writing, fact-checking, and multi-step reasoning.

===== Benchmarks and Evaluation =====
Several benchmarks have been developed to assess Agent-as-a-Judge systems:

  * **CodeVisionary** — Agent-based code evaluation with execution checks
  * **SAGEval** — Multi-[[agent_evaluation|agent evaluation]] with [[meta|meta]]-judge oversight
  * **ChatEval** — Dialogue-based [[agent_evaluation|agent evaluation]]
  * **AGENT-X** — Dynamic guideline-based evaluation
  * **FACT-AUDIT** — Fact-checking with autonomous evidence gathering

Key findings indicate that Agent-as-a-Judge delivers more robust and nuanced judgments than [[llm_as_judge|LLM-as-a-Judge]], but introduces challenges including higher computational cost, increased latency from sequential tool use, and potential safety and privacy risks from autonomous information gathering.((A Survey on Agent-as-a-Judge. [[https://arxiv.org/abs/2601.05111|arXiv:2601.05111]]))

<code python>
# Simplified Agent-as-a-Judge evaluation pipeline
class AgentJudge:
    def __init__(self, tools, memory):
        self.tools = tools  # code executor, search, etc.
        self.memory = memory

    def evaluate_trajectory(self, trajectory):
        """Evaluate a multi-step agent trajectory."""
        step_scores = []
        for i, step in enumerate(trajectory.steps):
            # Gather evidence using tools
            evidence = self.gather_evidence(step)
            # Score step in context of previous steps
            score = self.score_step(step, evidence, self.memory)
            step_scores.append(score)
            self.memory.update(step, score, evidence)
        return self.synthesize_verdict(step_scores)

    def gather_evidence(self, step):
        """Use tools to verify claims in a reasoning step."""
        evidence = []
        for tool in self.tools:
            if tool.is_relevant(step):
                result = tool.execute(step.claims)
                evidence.append(result)
        return evidence
</code>

===== See Also =====
  * [[agent_evaluation|Agent Evaluation]]
  * [[ai_agents|AI Agents]]
  * [[alphaeval|AlphaEval]]
  * [[how_to_evaluate_an_agent|How to Evaluate an Agent]]
  * [[multi_agent_architecture|Multi-Agent Architecture (Planner-Generator-Evaluator)]]

===== References =====