Table of Contents

HuggingGPT

HuggingGPT (also known as JARVIS) is a framework proposed by Shen et al., 2023 that uses ChatGPT as a controller to orchestrate hundreds of specialized AI models hosted on Hugging Face to solve complex multimodal tasks. The system demonstrated that LLMs can serve as a universal task planner and coordinator, leveraging the vast ecosystem of community models rather than building monolithic multimodal systems.1)

Architecture

HuggingGPT operates through a four-stage pipeline:

1. Task Planning

ChatGPT analyzes user requests, understands intent, and decomposes complex requests into a sequence of solvable subtasks. Each subtask is described with its task type, dependencies on other subtasks, and required inputs/outputs. The LLM uses structured prompting to produce a task dependency graph.

2. Model Selection

For each planned subtask, ChatGPT selects the most appropriate AI model from Hugging Face based on:

The system can access hundreds of models spanning NLP, computer vision, audio processing, and multimodal tasks.

3. Task Execution

Selected models are invoked with their designated inputs. Tasks are executed respecting the dependency graph, independent tasks can run in parallel while dependent tasks wait for their prerequisites. The execution engine handles model loading, input formatting, and output collection.

4. Response Generation

ChatGPT receives execution results from all subtasks and generates a coherent, human-readable summary that integrates outputs across modalities (text, images, audio) into a unified response.

Capabilities

HuggingGPT can handle tasks spanning multiple modalities:

Key Contributions

Code Example: HuggingFace Pipeline Task Routing

from transformers import pipeline
 
 
TASK_REGISTRY = {
    "sentiment": {"task": "sentiment-analysis", "model": "distilbert/distilbert-base-uncased-finetuned-sst-2-english"},
    "summarize": {"task": "summarization", "model": "sshleifer/distilbart-cnn-12-6"},
    "translate": {"task": "translation_en_to_fr", "model": "Helsinki-NLP/opus-mt-en-fr"},
    "generate": {"task": "text-generation", "model": "gpt2"},
    "ner": {"task": "ner", "model": "dslim/bert-base-NER"},
}
 
 
def classify_intent(user_query: str) -> str:
    """Simple keyword-based intent classifier for task routing."""
    query_lower = user_query.lower()
    if any(w in query_lower for w in ["feel", "sentiment", "positive", "negative"]):
        return "sentiment"
    if any(w in query_lower for w in ["summarize", "summary", "shorten"]):
        return "summarize"
    if any(w in query_lower for w in ["translate", "french", "translation"]):
        return "translate"
    if any(w in query_lower for w in ["name", "entity", "person", "organization"]):
        return "ner"
    return "generate"
 
 
def route_and_execute(user_query: str, text_input: str) -> dict:
    """Route a user query to the appropriate HuggingFace pipeline and execute."""
    intent = classify_intent(user_query)
    config = TASK_REGISTRYintent
    print(f"Routing to: {intent} -> {config['model']}")
 
    pipe = pipeline(config["task"], model=config["model"])
    result = pipe(text_input, max_length=100, truncation=True)
    return {"task": intent, "model": config["model"], "result": result}
 
 
queries = [
    ("What's the sentiment?", "This movie was absolutely wonderful and heartwarming!"),
    ("Summarize this text", "Large language models have transformed AI by enabling "
     "natural language understanding at scale. They power chatbots, code assistants, "
     "and [[autonomous_agents|autonomous agents]] across many industries."),
    ("Find named entities", "Elon Musk founded SpaceX in Hawthorne, California."),
]
 
for query, text in queries:
    output = route_and_execute(query, text)
    print(f"  Result: {output['result']}\n")

Limitations

Influence

HuggingGPT influenced subsequent work on LLM-as-controller architectures:

See Also

References