Table of Contents

Promptfoo

Promptfoo is an open-source CLI tool and library for testing, evaluating, and red teaming LLM applications.1) With over 18,000 stars on GitHub, it is used by organizations including OpenAI and Anthropic to systematically validate prompt quality, detect regressions, and scan for security vulnerabilities like prompt injection and PII exposure.

Promptfoo brings software testing rigor to LLM development with YAML-based configuration, side-by-side model comparisons, 100+ red teaming attack plugins, and native GitHub Actions integration for CI/CD pipelines.2)

How It Works

Promptfoo uses a declarative configuration approach. You define providers (LLM endpoints), prompts (templates with variables), and tests (input/output assertions) in a YAML config file. The tool runs each prompt through each provider with all test cases, applies assertions to score outputs pass/fail, and generates comparison reports.3)

For red teaming, Promptfoo ships with 100+ attack plugins that probe for vulnerabilities like prompt injection, PII exposure, excessive agency, and more — integrating directly into GitHub Actions to scan PRs automatically.4)

Key Features

Installation and Usage

# Install Promptfoo
# npm install -g promptfoo
# or use npx: npx promptfoo@latest init
 
# promptfooconfig.yaml - Example evaluation config
# providers:
#   - openai:gpt-4o
#   - anthropic:messages:claude-3-5-sonnet-20241022
#
# prompts:
#   - "Summarize this text in {{style}} style: {{text}}"
#   - "Write a {{style}} summary of: {{text}}"
#
# tests:
#   - vars:
#       text: "The quick brown fox jumps over the lazy dog"
#       style: "professional"
#     assert:
#       - type: contains
#         value: "fox"
#       - type: llm-rubric
#         value: "Response should be professional in tone"
#       - type: javascript
#         value: "output.length < 200"
#   - vars:
#       text: "Machine learning is transforming healthcare"
#       style: "casual"
#     assert:
#       - type: similar
#         value: "ML is changing medicine"
#         threshold: 0.7
 
# Run evaluation
# npx promptfoo@latest eval
 
# Run red teaming scan
# npx promptfoo@latest redteam run -c config-id -t target-id -o results.json
 
# Python provider example for custom logic
# custom_provider.py
import json
 
def call_api(prompt, options, context):
    # Your custom LLM logic here
    config = options.get("config", {})
    response = your_llm_call(prompt, **config)
    return {
        "output": response.text,
        "tokenUsage": {
            "prompt": response.prompt_tokens,
            "completion": response.completion_tokens,
            "total": response.total_tokens
        }
    }

Architecture

%%{init: {'theme': 'dark'}}%%
graph TB
    Dev([Developer]) -->|YAML Config| Config[promptfooconfig.yaml]
    Config -->|Providers| Providers{LLM Providers}
    Config -->|Prompts| Prompts[Prompt Templates]
    Config -->|Tests| Tests[Test Cases + Assertions]
    Providers -->|API Calls| OpenAI[OpenAI]
    Providers -->|API Calls| Anthropic[Anthropic]
    Providers -->|API Calls| Local[Local Models]
    Providers -->|API Calls| Custom[Custom Providers]
    EvalEngine[Evaluation Engine] -->|Run| Providers
    EvalEngine -->|Substitute| Prompts
    EvalEngine -->|Assert| Tests
    Tests -->|Scoring| Results[Results Report]
    Results -->|Web UI| WebUI[Interactive Dashboard]
    Results -->|Export| Export[CSV / JSON / HTML]
    RedTeam[Red Team Engine] -->|100+ Plugins| Attacks[Attack Scenarios]
    Attacks -->|Scan| Providers
    Attacks -->|Results| Security[Security Report]
    GHA[GitHub Actions] -->|Trigger| EvalEngine
    GHA -->|Trigger| RedTeam
    GHA -->|PR Comments| PR[Pull Request]

Red Teaming Categories

See Also

References