Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
HuggingGPT (also known as JARVIS) is a framework proposed by Shen et al., 2023 that uses ChatGPT as a controller to orchestrate hundreds of specialized AI models hosted on Hugging Face to solve complex multimodal tasks. The system demonstrated that LLMs can serve as a universal task planner and coordinator, leveraging the vast ecosystem of community models rather than building monolithic multimodal systems.1)
HuggingGPT operates through a four-stage pipeline:
ChatGPT analyzes user requests, understands intent, and decomposes complex requests into a sequence of solvable subtasks. Each subtask is described with its task type, dependencies on other subtasks, and required inputs/outputs. The LLM uses structured prompting to produce a task dependency graph.
For each planned subtask, ChatGPT selects the most appropriate AI model from Hugging Face based on:
The system can access hundreds of models spanning NLP, computer vision, audio processing, and multimodal tasks.
Selected models are invoked with their designated inputs. Tasks are executed respecting the dependency graph, independent tasks can run in parallel while dependent tasks wait for their prerequisites. The execution engine handles model loading, input formatting, and output collection.
ChatGPT receives execution results from all subtasks and generates a coherent, human-readable summary that integrates outputs across modalities (text, images, audio) into a unified response.
HuggingGPT can handle tasks spanning multiple modalities:
from transformers import pipeline TASK_REGISTRY = { "sentiment": {"task": "sentiment-analysis", "model": "distilbert/distilbert-base-uncased-finetuned-sst-2-english"}, "summarize": {"task": "summarization", "model": "sshleifer/distilbart-cnn-12-6"}, "translate": {"task": "translation_en_to_fr", "model": "Helsinki-NLP/opus-mt-en-fr"}, "generate": {"task": "text-generation", "model": "gpt2"}, "ner": {"task": "ner", "model": "dslim/bert-base-NER"}, } def classify_intent(user_query: str) -> str: """Simple keyword-based intent classifier for task routing.""" query_lower = user_query.lower() if any(w in query_lower for w in ["feel", "sentiment", "positive", "negative"]): return "sentiment" if any(w in query_lower for w in ["summarize", "summary", "shorten"]): return "summarize" if any(w in query_lower for w in ["translate", "french", "translation"]): return "translate" if any(w in query_lower for w in ["name", "entity", "person", "organization"]): return "ner" return "generate" def route_and_execute(user_query: str, text_input: str) -> dict: """Route a user query to the appropriate HuggingFace pipeline and execute.""" intent = classify_intent(user_query) config = TASK_REGISTRYintent print(f"Routing to: {intent} -> {config['model']}") pipe = pipeline(config["task"], model=config["model"]) result = pipe(text_input, max_length=100, truncation=True) return {"task": intent, "model": config["model"], "result": result} queries = [ ("What's the sentiment?", "This movie was absolutely wonderful and heartwarming!"), ("Summarize this text", "Large language models have transformed AI by enabling " "natural language understanding at scale. They power chatbots, code assistants, " "and [[autonomous_agents|autonomous agents]] across many industries."), ("Find named entities", "Elon Musk founded SpaceX in Hawthorne, California."), ] for query, text in queries: output = route_and_execute(query, text) print(f" Result: {output['result']}\n")
HuggingGPT influenced subsequent work on LLM-as-controller architectures: