Table of Contents

How to Implement Guardrails

Guardrails are safety mechanisms that validate, filter, and constrain LLM inputs and outputs. They prevent prompt injection, block harmful content, redact sensitive data, and enforce structured output formats. This guide covers implementation patterns, tools, and production deployment.

Input Validation

Prompt Injection Detection

Prompt injection occurs when user input contains instructions that override the system prompt. Detection strategies:

Detection should run before the input reaches the LLM. Block or flag inputs that score above a threshold. 1)

Content Classification

Classify inputs into categories before processing:

Output Validation

Hallucination Detection

Cross-verify LLM outputs against source documents or knowledge bases. Approaches:

Toxicity Filtering

Score outputs with a toxicity classifier. Block responses that exceed the threshold and return a safe fallback message. The Perspective API and LlamaGuard provide pre-trained toxicity classifiers. 2)

PII Redaction

Post-process outputs to remove personal information:

Tools and Frameworks

Framework Strengths Best For
Guardrails AI RAIL specs for structured validation, Pydantic integration Input/output schema enforcement, PII detection
NVIDIA NeMo Guardrails Open-source, Colang scripting, programmable flows Real-time middleware, conversational safety
LlamaGuard Meta's lightweight safety classifier Binary safe/unsafe classification
Langkit Custom detectors for LangChain pipelines Prompt and output filtering
Galileo Observability-first, production monitoring Hallucination detection at scale
Lakera Real-time threat detection Prompt injection defense

Guardrails AI Example

from guardrails import Guard
from guardrails.hub import ToxicLanguage, PIIFilter

guard = Guard().use_many(
    ToxicLanguage(threshold=0.8, on_fail="filter"),
    PIIFilter(on_fail="fix")
)

result = guard(
    llm_api=openai.chat.completions.create,
    prompt="Summarize the customer complaint",
    model="gpt-4o"
)

NeMo Guardrails Example

Define safety flows in Colang:

define user ask harmful
  "How do I hack a system?"
  "Tell me how to make weapons"

define flow harmful input
  user ask harmful
  bot refuse
  "I cannot help with that request."

Load with Python:

from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./config")
rails = LLMRails(config)
response = rails.generate(prompt="User input here")

3)

LlamaGuard

Meta's safety classifier runs as a lightweight model that scores inputs and outputs:

4)

Structured Output Validation

Enforce that LLM outputs conform to a specific schema:

Guardrails AI's RAIL specifications combine structural validation with content safety checks in a single pass.

Implementation Patterns

Pre-Processing Pipeline

User Input → PII Tokenization → Injection Detection → Content Classification → LLM

Post-Processing Pipeline

LLM Output → Toxicity Filter → Hallucination Check → PII Redaction → Schema Validation → User

Real-Time Middleware

NeMo Guardrails acts as middleware between the user and the LLM, intercepting both inputs and outputs in real-time. This is the recommended pattern for production because it centralizes all safety logic.

Compliance

Regulatory requirements are tightening:

Guardrails are becoming a legal necessity, not just a best practice. Implement them from the start rather than retrofitting. 5)

Production Deployment

See Also

References