Table of Contents

LLM-as-a-Judge

LLM-as-a-Judge is an evaluation paradigm introduced by Zheng et al. (2023) that uses strong large language models (such as GPT-4) to automatically evaluate the quality of LLM-generated responses. The paper introduces two complementary benchmarks – MT-Bench and Chatbot Arena – and systematically analyzes the biases, agreement rates, and practical viability of using LLMs as scalable proxies for human evaluation.

Motivation

Evaluating LLM-based chat assistants faces fundamental challenges:

LLM-as-a-Judge addresses these by leveraging strong LLMs to approximate human preferences at a fraction of the cost while maintaining over 80% agreement with human annotators.

MT-Bench

MT-Bench is a multi-turn question set designed to evaluate chat assistants across 8 diverse categories:

  1. Writing: Creative and professional writing tasks
  2. Roleplay: Character-based conversational scenarios
  3. Extraction: Information extraction from provided text
  4. STEM: Science, technology, engineering, and math questions
  5. Reasoning: Logical and analytical reasoning problems
  6. Coding: Programming tasks and code analysis
  7. Math: Mathematical computation and proof tasks
  8. Humanities: History, philosophy, and social science questions

Each question has two turns – a follow-up question tests the model's ability to maintain context and build on its initial response. This multi-turn design is critical because single-turn evaluation misses crucial aspects of conversational AI quality.

Chatbot Arena

Chatbot Arena is a crowdsourced battle platform where:

This provides a diverse, real-world evaluation signal that complements the controlled MT-Bench setting. With over 30K conversations and human preferences publicly released, it has become a de facto standard for LLM comparison.

Evaluation Modes

The paper explores three LLM-as-a-Judge configurations:

Bias Analysis

Three systematic biases were identified and analyzed:

Position Bias LLM judges tend to favor responses presented in certain positions (e.g., the first response). GPT-4 shows high consistency across position swaps, but weaker models exhibit significant position bias.

<latex> \text{Consistency} = P(\text{same judgment} \mid \text{position swap}) </latex>

GPT-4 achieves the highest consistency rate among tested judges.

Verbosity Bias LLM judges tend to prefer longer, more detailed responses even when the additional content does not improve quality. This can unfairly penalize concise but correct answers.

Self-Enhancement Bias LLMs show a tendency to favor their own generated outputs – a model used as a judge may rate its own responses higher than those from competitors.

Mitigation Strategies

Agreement with Human Preferences

The central finding is that strong LLM judges achieve human-level agreement:

Code Example

import openai
 
def llm_judge_pairwise(question, response_a, response_b, model="gpt-4"):
    """Use LLM-as-a-Judge for pairwise comparison."""
    prompt = (
        "Please act as an impartial judge and evaluate the quality "
        "of the responses provided by two AI assistants to the user "
        "question below.\n\n"
        "Avoid position bias -- evaluate based on content quality only.\n\n"
        f"[User Question]\n{question}\n\n"
        f"[Assistant A]\n{response_a}\n\n"
        f"[Assistant B]\n{response_b}\n\n"
        "[Evaluation]\n"
        "Provide reasoning, then verdict: "
        '"[[A]]" if A is better, "[[B]]" if B is better, "[[C]]" for tie.'
    )
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    judgment = response.choices[0].message.content
    if "[[A]]" in judgment:
        return "A", judgment
    elif "[[B]]" in judgment:
        return "B", judgment
    return "tie", judgment
 
def evaluate_with_position_swap(question, resp_a, resp_b):
    """Mitigate position bias by evaluating in both orders."""
    verdict_1, _ = llm_judge_pairwise(question, resp_a, resp_b)
    verdict_2, _ = llm_judge_pairwise(question, resp_b, resp_a)
    # Reconcile verdicts accounting for the swap
    v2_adjusted = "B" if verdict_2 == "A" else ("A" if verdict_2 == "B" else "tie")
    if verdict_1 == v2_adjusted:
        return verdict_1
    return "tie"  # Disagreement defaults to tie

Impact and Adoption

LLM-as-a-Judge has become a foundational methodology in LLM evaluation:

Limitations

References

See Also