====== How to Implement Guardrails ====== Guardrails are safety mechanisms that validate, filter, and constrain LLM inputs and outputs. They prevent prompt injection, block harmful content, redact sensitive data, and enforce structured output formats. This guide covers implementation patterns, tools, and production deployment. ===== Input Validation ===== === Prompt Injection Detection === Prompt injection occurs when user input contains instructions that override the system prompt. Detection strategies: * **Pattern matching** -- regex rules for common injection phrases (''ignore previous instructions'', ''you are now'') * **Semantic classification** -- a trained classifier that scores inputs for injection likelihood * **Canary tokens** -- embed hidden tokens in the system prompt; if they appear in the output, injection occurred * **Input/output delimiters** -- clearly separate system instructions from user input with structured formatting Detection should run before the input reaches the LLM. Block or flag inputs that score above a threshold. ((Source: [[https://cloudsecurityalliance.org/blog/2025/12/10/how-to-build-ai-prompt-guardrails-an-in-depth-guide-for-securing-enterprise-genai|Cloud Security Alliance - AI Prompt Guardrails]])) === Content Classification === Classify inputs into categories before processing: * **On-topic vs off-topic** -- reject queries outside the assistant's domain * **PII detection** -- identify and tokenize personal data before it reaches the model * **Toxicity scoring** -- block abusive or harmful inputs * **Risk levels** -- route high-risk queries to human review ===== Output Validation ===== === Hallucination Detection === Cross-verify LLM outputs against source documents or knowledge bases. Approaches: * **Retrieval-based verification** -- check if output claims are supported by retrieved context * **Self-consistency** -- generate multiple responses and flag contradictions * **Confidence scoring** -- use log probabilities to identify uncertain claims * **Fact-checking pipeline** -- a secondary model verifies factual claims === Toxicity Filtering === Score outputs with a toxicity classifier. Block responses that exceed the threshold and return a safe fallback message. The Perspective API and LlamaGuard provide pre-trained toxicity classifiers. ((Source: [[https://galileo.ai/blog/best-ai-guardrails-platforms|Galileo - Best AI Guardrails Platforms]])) === PII Redaction === Post-process outputs to remove personal information: * Named Entity Recognition (NER) to detect names, addresses, phone numbers * Regex patterns for structured data (SSNs, credit cards, emails) * Replace detected PII with placeholder tokens (''[REDACTED]'') ===== Tools and Frameworks ===== ^ Framework ^ Strengths ^ Best For ^ | Guardrails AI | RAIL specs for structured validation, Pydantic integration | Input/output schema enforcement, PII detection | | NVIDIA NeMo Guardrails | Open-source, Colang scripting, programmable flows | Real-time middleware, conversational safety | | LlamaGuard | Meta's lightweight safety classifier | Binary safe/unsafe classification | | Langkit | Custom detectors for LangChain pipelines | Prompt and output filtering | | Galileo | Observability-first, production monitoring | Hallucination detection at scale | | Lakera | Real-time threat detection | Prompt injection defense | === Guardrails AI Example === from guardrails import Guard from guardrails.hub import ToxicLanguage, PIIFilter guard = Guard().use_many( ToxicLanguage(threshold=0.8, on_fail="filter"), PIIFilter(on_fail="fix") ) result = guard( llm_api=openai.chat.completions.create, prompt="Summarize the customer complaint", model="gpt-4o" ) === NeMo Guardrails Example === Define safety flows in Colang: define user ask harmful "How do I hack a system?" "Tell me how to make weapons" define flow harmful input user ask harmful bot refuse "I cannot help with that request." Load with Python: from nemoguardrails import RailsConfig, LLMRails config = RailsConfig.from_path("./config") rails = LLMRails(config) response = rails.generate(prompt="User input here") ((Source: [[https://galileo.ai/blog/best-ai-guardrails-platforms|Galileo - Best AI Guardrails Platforms]])) === LlamaGuard === Meta's safety classifier runs as a lightweight model that scores inputs and outputs: * Returns binary safe/unsafe classification with category labels * Categories include violence, sexual content, criminal planning, self-harm * Can be self-hosted alongside the main LLM for zero-latency checks ((Source: [[https://www.openlayer.com/blog/post/ai-guardrails-llm-guide|Openlayer - AI Guardrails Guide]])) ===== Structured Output Validation ===== Enforce that LLM outputs conform to a specific schema: * **JSON Schema validation** -- define expected output structure, reject non-conforming responses * **Pydantic models** -- parse LLM output into typed Python objects * **Strict mode** -- OpenAI's ''strict: true'' parameter enforces schema compliance at generation time * **Retry on failure** -- if output fails validation, reprompt with the error message Guardrails AI's RAIL specifications combine structural validation with content safety checks in a single pass. ===== Implementation Patterns ===== === Pre-Processing Pipeline === User Input → PII Tokenization → Injection Detection → Content Classification → LLM === Post-Processing Pipeline === LLM Output → Toxicity Filter → Hallucination Check → PII Redaction → Schema Validation → User === Real-Time Middleware === NeMo Guardrails acts as middleware between the user and the LLM, intercepting both inputs and outputs in real-time. This is the recommended pattern for production because it centralizes all safety logic. ===== Compliance ===== Regulatory requirements are tightening: * **California SB 243** -- requires companion AI safeguards and reporting (effective 2027) * **EU AI Act** -- mandates risk-based guardrails for high-risk AI systems * **Enterprise requirements** -- SOC 2, HIPAA, GDPR compliance requires PII controls and audit logging Guardrails are becoming a legal necessity, not just a best practice. Implement them from the start rather than retrofitting. ((Source: [[https://statetechmagazine.com/article/2026/01/ai-guardrails-will-stop-being-optional-2026|StateTech - AI Guardrails Will Stop Being Optional]])) ===== Production Deployment ===== * **Monitor all guardrail triggers** -- track false positive rates and adjust thresholds * **Red-team regularly** -- adversarial testing to find bypass vectors * **Layer defenses** -- no single guardrail is sufficient; combine multiple approaches * **CI/CD integration** -- run guardrail evaluations as automated gates before deployment * **Audit logging** -- log every blocked input and filtered output for compliance review * **Performance budget** -- guardrails should add less than 250ms to total latency ===== See Also ===== * [[how_to_monitor_agents|How to Monitor Agents]] * [[how_to_build_an_ai_assistant|How to Build an AI Assistant]] * [[how_to_create_an_agent|How to Create an Agent]] ===== References =====