====== HuggingGPT ====== **HuggingGPT** (also known as **JARVIS**) is a framework proposed by [[https://arxiv.org/abs/2303.17580|Shen et al., 2023]] that uses ChatGPT as a controller to orchestrate hundreds of specialized AI models hosted on Hugging Face to solve complex multimodal tasks. The system demonstrated that LLMs can serve as a universal task planner and coordinator, leveraging the vast ecosystem of community models rather than building monolithic multimodal systems.(([[https://arxiv.org/abs/2303.17580|arXiv:2303.17580]])) * **Paper:** [[https://arxiv.org/abs/2303.17580|arXiv:2303.17580]] (March 2023) * **Authors:** Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang * **Institutions:** Zhejiang University, [[microsoft|Microsoft]] Research Asia * **[[github|GitHub]]:** [[https://github.com/[[microsoft|microsoft]]/JARVIS|github.com/[[microsoft|microsoft]]/JARVIS]] ===== Architecture ===== HuggingGPT operates through a four-stage pipeline: === 1. Task Planning === ChatGPT analyzes user requests, understands intent, and decomposes complex requests into a sequence of solvable subtasks. Each subtask is described with its task type, dependencies on other subtasks, and required inputs/outputs. The LLM uses structured prompting to produce a task dependency graph. === 2. Model Selection === For each planned subtask, ChatGPT selects the most appropriate AI model from [[hugging_face|Hugging Face]] based on: * Model function descriptions (task type matching) * Download counts (popularity as a quality proxy) * Model performance metrics when available The system can access hundreds of models spanning NLP, computer vision, audio processing, and multimodal tasks. === 3. Task Execution === Selected models are invoked with their designated inputs. Tasks are executed respecting the dependency graph, independent tasks can run in parallel while dependent tasks wait for their prerequisites. The execution engine handles model loading, input formatting, and output collection. === 4. Response Generation === ChatGPT receives execution results from all subtasks and generates a coherent, human-readable summary that integrates outputs across modalities (text, images, audio) into a unified response. ===== Capabilities ===== HuggingGPT can handle tasks spanning multiple modalities: * **Language:** Summarization, translation, question answering, sentiment analysis * **Vision:** Image classification, object detection, image generation, visual QA * **Audio:** Speech recognition, text-to-speech, audio classification * **Multimodal:** Image captioning, visual grounding, cross-[[modal|modal]] generation ===== Key Contributions ===== * **Language as universal interface:** Demonstrated that natural language can serve as the communication protocol between an LLM controller and specialized AI models * **Model ecosystem leverage:** Rather than building one large multimodal model, orchestrate existing specialist models * **[[task_decomposition|Task decomposition]]:** Showed effective automatic decomposition of complex requests into model-appropriate subtasks * **Toward AGI:** Presented a practical architecture for combining diverse AI capabilities under unified control ===== Code Example: HuggingFace Pipeline Task Routing ===== from transformers import pipeline TASK_REGISTRY = { "sentiment": {"task": "sentiment-analysis", "model": "distilbert/distilbert-base-uncased-finetuned-sst-2-english"}, "summarize": {"task": "summarization", "model": "sshleifer/distilbart-cnn-12-6"}, "translate": {"task": "translation_en_to_fr", "model": "Helsinki-NLP/opus-mt-en-fr"}, "generate": {"task": "text-generation", "model": "gpt2"}, "ner": {"task": "ner", "model": "dslim/bert-base-NER"}, } def classify_intent(user_query: str) -> str: """Simple keyword-based intent classifier for task routing.""" query_lower = user_query.lower() if any(w in query_lower for w in ["feel", "sentiment", "positive", "negative"]): return "sentiment" if any(w in query_lower for w in ["summarize", "summary", "shorten"]): return "summarize" if any(w in query_lower for w in ["translate", "french", "translation"]): return "translate" if any(w in query_lower for w in ["name", "entity", "person", "organization"]): return "ner" return "generate" def route_and_execute(user_query: str, text_input: str) -> dict: """Route a user query to the appropriate HuggingFace pipeline and execute.""" intent = classify_intent(user_query) config = TASK_REGISTRYintent print(f"Routing to: {intent} -> {config['model']}") pipe = pipeline(config["task"], model=config["model"]) result = pipe(text_input, max_length=100, truncation=True) return {"task": intent, "model": config["model"], "result": result} queries = [ ("What's the sentiment?", "This movie was absolutely wonderful and heartwarming!"), ("Summarize this text", "Large language models have transformed AI by enabling " "natural language understanding at scale. They power chatbots, code assistants, " "and [[autonomous_agents|autonomous agents]] across many industries."), ("Find named entities", "Elon Musk founded SpaceX in Hawthorne, California."), ] for query, text in queries: output = route_and_execute(query, text) print(f" Result: {output['result']}\n") ===== Limitations ===== * **Latency:** Multi-stage pipeline with external model calls introduces significant latency * **Context window:** Complex task plans with many models can exceed LLM context limits * **Model availability:** Dependent on [[hugging_face|Hugging Face]] model hosting and API availability * **Error cascading:** Failures in early pipeline stages propagate to downstream tasks * **Cost:** Multiple LLM calls for planning plus inference costs for specialist models ===== Influence ===== HuggingGPT influenced subsequent work on LLM-as-controller architectures: * Demonstrated the viability of LLMs orchestrating external AI models * Inspired [[tool_integration_patterns|tool integration patterns]] where models dynamically select from registries of capabilities * Informed designs of [[anthropic_context_protocol|MCP]] and other protocol-based tool access systems * Related to [[mrkl_systems|MRKL]] ([[https://arxiv.org/abs/2205.00445|Karpas et al., 2022]]) in its [[modular|modular]] routing approach, but extended to multimodal AI models(([[https://arxiv.org/abs/2205.00445|arXiv:2205.00445]])) ===== See Also ===== * [[chatgpt|ChatGPT]] * [[huggingface|Hugging Face]] * [[hugging_face|Hugging Face]] * [[chatdev|ChatDev]] * [[ai_wrappers|AI Wrappers]] ===== References ===== ===== Related Pages ===== * [[mrkl_systems|MRKL Systems]] * [[toolformer|Toolformer]] * [[tool_integration_patterns|Tool Integration Patterns]] * [[multi_agent_systems|Multi-Agent Systems]] * [[tool_augmented_language_models|Tool-Augmented Language Models]]