====== Together AI ======

**Together AI** is a full-stack AI cloud platform specializing in fast inference, fine-tuning, pre-training, and GPU cluster management for open-source models. Founded as an inference-focused startup, Together AI reached a $3.3 billion valuation and $300 million annualized revenue by September 2025, serving companies including Cursor, Decagon, and Cartesia.((source [[https://www.together.ai|Together AI official site]]))

===== Overview =====

Together AI provides developers and researchers with a unified API to run, train, fine-tune, and deploy open-source AI models across text, image, video, code, and voice modalities. The platform emphasizes end-to-end workflows from training to production, supporting over 200 open-source models with an OpenAI-compatible API for seamless integration.((source [[https://www.together.ai/serverless-inference|Together AI Serverless Inference]]))

===== Supported Models =====

The platform supports a broad range of open-source models including:

  * **Llama** variants (Meta)
  * **Mixtral** and **Mistral** models
  * **Qwen** series (including Qwen-3-235B-Instruct and Qwen-3-Coder-480B)
  * **DeepSeek** models (DeepSeek-R1, DeepSeek-V3.1)
  * **GPT-OSS** (20B and 120B variants)
  * Image and video generation models (40+ via Runware partnership)
  * NVIDIA Parakeet for voice ASR

===== Inference =====

Together AI achieves up to 2.75x faster serverless inference compared to competitors through GPU optimizations, low-bit quantization (FP4/FP8), and **ATLAS** (Adaptive Speculative Decoding), which provides up to 4x acceleration via runtime learning.((source [[https://www.together.ai/blog/fastest-inference-for-the-top-open-source-models|Together AI Fastest Inference]]))

The platform offers two inference tiers:

  * **Serverless Inference**: On-demand model execution without infrastructure management, supporting batch inference at 50% lower cost
  * **Dedicated Endpoints**: Instant GPU clusters (from self-serve to thousands of GPUs) for low-latency, high-throughput workloads, achieving up to 110 tokens/sec on reasoning clusters

===== Fine-Tuning =====

Together AI provides a full fine-tuning platform supporting **LoRA** and **DPO** methods for task-specific model customization using proprietary data. The platform also supports pre-training from scratch on GPU clusters, with seamless transition from training to inference endpoints.((source [[https://aiagentslist.com/agents/together-ai|Together AI Profile]]))

===== Pricing =====

Pricing ranges from **$0.10 to $3.50 per million tokens** depending on model size and optimization level. Batch inference is available at 50% lower cost. The platform claims approximately 60% cost reduction overall through quantization and inference optimizations.((source [[https://www.together.ai|Together AI]]))

===== Recent Developments =====

  * Achieved top benchmarks for inference speed on demanding models (2x faster on GPT-OSS, Qwen, DeepSeek)
  * Launched Dedicated Container Inference for custom media models with 1.4x-2.6x speedups
  * Hit $300 million ARR by September 2025
  * Showcased at NVIDIA GTC 2026 with NemoClaw integration for 150+ models((source [[https://www.together.ai/blog/together-ai-at-nvidia-gtc-2026|Together AI at NVIDIA GTC 2026]]))

===== See Also =====

  * [[replicate|Replicate]]
  * [[fireworks_ai|Fireworks AI]]
  * [[groq_inference|Groq Inference]]

===== References =====