Guardrails are safety mechanisms that validate, filter, and constrain LLM inputs and outputs. They prevent prompt injection, block harmful content, redact sensitive data, and enforce structured output formats. This guide covers implementation patterns, tools, and production deployment.
Prompt injection occurs when user input contains instructions that override the system prompt. Detection strategies:
ignore previous instructions, you are now)Detection should run before the input reaches the LLM. Block or flag inputs that score above a threshold. 1)
Classify inputs into categories before processing:
Cross-verify LLM outputs against source documents or knowledge bases. Approaches:
Score outputs with a toxicity classifier. Block responses that exceed the threshold and return a safe fallback message. The Perspective API and LlamaGuard provide pre-trained toxicity classifiers. 2)
Post-process outputs to remove personal information:
[REDACTED])| Framework | Strengths | Best For |
|---|---|---|
| Guardrails AI | RAIL specs for structured validation, Pydantic integration | Input/output schema enforcement, PII detection |
| NVIDIA NeMo Guardrails | Open-source, Colang scripting, programmable flows | Real-time middleware, conversational safety |
| LlamaGuard | Meta's lightweight safety classifier | Binary safe/unsafe classification |
| Langkit | Custom detectors for LangChain pipelines | Prompt and output filtering |
| Galileo | Observability-first, production monitoring | Hallucination detection at scale |
| Lakera | Real-time threat detection | Prompt injection defense |
from guardrails import Guard
from guardrails.hub import ToxicLanguage, PIIFilter
guard = Guard().use_many(
ToxicLanguage(threshold=0.8, on_fail="filter"),
PIIFilter(on_fail="fix")
)
result = guard(
llm_api=openai.chat.completions.create,
prompt="Summarize the customer complaint",
model="gpt-4o"
)
Define safety flows in Colang:
define user ask harmful "How do I hack a system?" "Tell me how to make weapons" define flow harmful input user ask harmful bot refuse "I cannot help with that request."
Load with Python:
from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
response = rails.generate(prompt="User input here")
Meta's safety classifier runs as a lightweight model that scores inputs and outputs:
Enforce that LLM outputs conform to a specific schema:
strict: true parameter enforces schema compliance at generation timeGuardrails AI's RAIL specifications combine structural validation with content safety checks in a single pass.
User Input → PII Tokenization → Injection Detection → Content Classification → LLM
LLM Output → Toxicity Filter → Hallucination Check → PII Redaction → Schema Validation → User
NeMo Guardrails acts as middleware between the user and the LLM, intercepting both inputs and outputs in real-time. This is the recommended pattern for production because it centralizes all safety logic.
Regulatory requirements are tightening:
Guardrails are becoming a legal necessity, not just a best practice. Implement them from the start rather than retrofitting. 5)