====== Together AI ====== **Together AI** is a full-stack AI cloud platform specializing in fast inference, fine-tuning, pre-training, and GPU cluster management for open-source models. Founded as an inference-focused startup, Together AI reached a $3.3 billion valuation and $300 million annualized revenue by September 2025, serving companies including Cursor, Decagon, and Cartesia.((source [[https://www.together.ai|Together AI official site]])) ===== Overview ===== Together AI provides developers and researchers with a unified API to run, train, fine-tune, and deploy open-source AI models across text, image, video, code, and voice modalities. The platform emphasizes end-to-end workflows from training to production, supporting over 200 open-source models with an OpenAI-compatible API for seamless integration.((source [[https://www.together.ai/serverless-inference|Together AI Serverless Inference]])) ===== Supported Models ===== The platform supports a broad range of open-source models including: * **Llama** variants (Meta) * **Mixtral** and **Mistral** models * **Qwen** series (including Qwen-3-235B-Instruct and Qwen-3-Coder-480B) * **DeepSeek** models (DeepSeek-R1, DeepSeek-V3.1) * **GPT-OSS** (20B and 120B variants) * Image and video generation models (40+ via Runware partnership) * NVIDIA Parakeet for voice ASR ===== Inference ===== Together AI achieves up to 2.75x faster serverless inference compared to competitors through GPU optimizations, low-bit quantization (FP4/FP8), and **ATLAS** (Adaptive Speculative Decoding), which provides up to 4x acceleration via runtime learning.((source [[https://www.together.ai/blog/fastest-inference-for-the-top-open-source-models|Together AI Fastest Inference]])) The platform offers two inference tiers: * **Serverless Inference**: On-demand model execution without infrastructure management, supporting batch inference at 50% lower cost * **Dedicated Endpoints**: Instant GPU clusters (from self-serve to thousands of GPUs) for low-latency, high-throughput workloads, achieving up to 110 tokens/sec on reasoning clusters ===== Fine-Tuning ===== Together AI provides a full fine-tuning platform supporting **LoRA** and **DPO** methods for task-specific model customization using proprietary data. The platform also supports pre-training from scratch on GPU clusters, with seamless transition from training to inference endpoints.((source [[https://aiagentslist.com/agents/together-ai|Together AI Profile]])) ===== Pricing ===== Pricing ranges from **$0.10 to $3.50 per million tokens** depending on model size and optimization level. Batch inference is available at 50% lower cost. The platform claims approximately 60% cost reduction overall through quantization and inference optimizations.((source [[https://www.together.ai|Together AI]])) ===== Recent Developments ===== * Achieved top benchmarks for inference speed on demanding models (2x faster on GPT-OSS, Qwen, DeepSeek) * Launched Dedicated Container Inference for custom media models with 1.4x-2.6x speedups * Hit $300 million ARR by September 2025 * Showcased at NVIDIA GTC 2026 with NemoClaw integration for 150+ models((source [[https://www.together.ai/blog/together-ai-at-nvidia-gtc-2026|Together AI at NVIDIA GTC 2026]])) ===== See Also ===== * [[replicate|Replicate]] * [[fireworks_ai|Fireworks AI]] * [[groq_inference|Groq Inference]] ===== References =====