Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
LLM agents are reshaping personalized recommendation by replacing static retrieval heuristics with dynamic, reasoning-driven systems that browse, understand user intent, and adaptively rank items. AgentRecBench (NeurIPS 2025) provides the first comprehensive benchmark, while ARAG (Walmart, 2025) demonstrates production-scale agentic recommendation via multi-agent RAG.
AgentRecBench introduces an interactive textual simulator and modular agent framework for evaluating LLM-powered recommender systems. Accepted at NeurIPS 2025, it is the first benchmark designed specifically for agent-based recommendation.
Interactive Simulator: A textual environment with rich user/item metadata (profiles, reviews, interaction histories) that supports autonomous information retrieval. Agents navigate user-item networks to gather evidence before making recommendations.
Three Evaluation Scenarios:
Modular Agent Framework with four core components:
| Component | Function |
|---|---|
| Dynamic Planning | Task decomposition and strategy selection |
| Complex Reasoning | Multi-step decision-making about user preferences |
| Tool Utilization | Environment interaction (fetching user data, item metadata) |
| Memory Management | Retaining experiences for self-improvement across sessions |
Benchmarks over 10 methods including classical (handcrafted features), ranking-oriented agents (RecMind), simulation-oriented agents (Agent4Rec), and conversational approaches. Metrics focus on Top-N ranking accuracy with 20 candidates per test instance.
ARAG (Walmart Global Tech, 2025) integrates a multi-agent collaboration mechanism into the RAG pipeline for personalized recommendation. It addresses the failure of static RAG heuristics to capture nuanced user preferences in dynamic scenarios.
Four Specialized Agents:
The recommendation score for an item $i$ given user context $u$ is computed through the agent pipeline:
$$\text{score}(i | u) = f_{\text{rank}}\left( f_{\text{summary}}\left( f_{\text{NLI}}(i, f_{\text{user}}(u)) \right) \right)$$
where each $f$ represents a specialized agent's transformation of the information.
class ARAGPipeline: def __init__(self, retriever, user_agent, nli_agent, summary_agent, ranker_agent): self.retriever = retriever self.user_agent = user_agent self.nli_agent = nli_agent self.summary_agent = summary_agent self.ranker_agent = ranker_agent def recommend(self, user_id, session_context, k=5): user_profile = self.user_agent.summarize_preferences( long_term=self.retriever.get_user_history(user_id), session=session_context ) candidates = self.retriever.retrieve_candidates(user_profile) nli_scores = [] for item in candidates: alignment = self.nli_agent.evaluate_alignment( item_description=item.metadata, user_intent=user_profile.intent ) nli_scores.append((item, alignment)) context_summary = self.summary_agent.summarize(nli_scores) ranked_items = self.ranker_agent.rank( candidates=nli_scores, context=context_summary, top_k=k ) return ranked_items
ARAG achieves significant improvements over standard RAG and recency-based baselines across three datasets:
| Metric | Improvement |
|---|---|
| NDCG@5 | up to 42.1% improvement |
| Hit@5 | up to 35.5% improvement |
Ablation studies confirm that each agent contributes meaningfully, with the NLI agent and User Understanding agent showing the largest individual contributions.