Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Mixture of Agents (MoA) is a methodology for leveraging the collective strengths of multiple large language models through a layered architecture where LLMs iteratively refine each other's outputs. Proposed by Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou, MoA achieves state-of-the-art performance on major benchmarks using only open-source LLMs, surpassing GPT-4 Omni. The paper was accepted as a Spotlight paper at ICLR 2025.
The core insight behind MoA is that LLMs tend to generate better responses when provided with outputs from other models as reference, even if those reference outputs are of lower quality individually. MoA exploits this “collaborativeness” property by organizing models into layers where each layer's agents refine the outputs of the previous layer.
This approach demonstrates that orchestrating multiple smaller, open-source models can exceed the performance of the largest proprietary models, offering a practical path to high-quality LLM outputs without dependence on any single provider.
MoA organizes LLMs into l layers, each containing n agents:
The process repeats for multiple cycles (typically 2-3 layers), with each iteration yielding more robust outputs by addressing individual model weaknesses through diverse perspectives.
MoA distinguishes two functional roles:
Adding multiple aggregator layers iteratively improves quality. On MATH reasoning tasks, Layer 1 accuracy of 0.428 (Qwen1.5-72B-Chat) improves to 0.552 by Layer 3.
MoA achieves leading results across major LLM evaluation benchmarks:
| Benchmark | MoA Score | GPT-4 Omni | Margin |
|---|---|---|---|
| AlpacaEval 2.0 | 65.1% | 57.5% | +7.6% |
| MT-Bench | 9.25 avg | 9.19 | +0.06 |
| Arena-Hard | SOTA | — | — |
| FLASK | Outperforms | — | Correctness, factuality |
Notably, the MoA configuration achieving 65.1% on AlpacaEval 2.0 uses only open-source LLMs, demonstrating that collective inference with smaller models can surpass the best proprietary models.
A simplified MoA implementation with proposers and an aggregator:
from openai import OpenAI PROPOSER_MODELS = ["mistral-7b", "llama-3-70b", "qwen-72b"] AGGREGATOR_MODEL = "qwen-72b" def moa_inference(prompt: str, num_layers: int = 3) -> str: client = OpenAI(base_url="https://api.together.xyz/v1") # Layer 1: Independent proposals proposals = [] for model in PROPOSER_MODELS: response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}] ) proposals.append(response.choices[0].message.content) # Layers 2+: Iterative refinement for layer in range(1, num_layers): context = "\n\n".join( f"[Model {i+1}]: {p}" for i, p in enumerate(proposals) ) refined = [] for model in PROPOSER_MODELS: response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": f"Previous responses:\n{context}"}, {"role": "user", "content": f"Synthesize and improve: {prompt}"} ] ) refined.append(response.choices[0].message.content) proposals = refined # Final aggregation final_context = "\n\n".join( f"[Response {i+1}]: {p}" for i, p in enumerate(proposals) ) final = client.chat.completions.create( model=AGGREGATOR_MODEL, messages=[ {"role": "system", "content": f"Synthesize the best answer:\n{final_context}"}, {"role": "user", "content": prompt} ] ) return final.choices[0].message.content