====== Groq Inference ======

**Groq** is an AI inference company founded in 2016 by Jonathan Ross, the original designer of Google's Tensor Processing Unit (TPU). Groq develops and operates the **Language Processing Unit (LPU)**, a custom ASIC chip purpose-built for ultra-fast, low-latency inference of large language models. The company's GroqCloud platform serves over 2.8 million developers worldwide.((source [[https://groq.com|Groq official site]]))

===== LPU Hardware Architecture =====

The LPU is fundamentally different from GPUs in its approach to AI inference. It uses a **Tensor Streaming Processor (TSP)** design that prioritizes sequential token generation over general-purpose parallel computation.((source [[https://crazyrouter.com/en/blog/groq-api-complete-guide-fastest-inference-2026|Groq API Guide 2026]]))

Key architectural features:

  * **On-chip SRAM** for deterministic, low-latency data access (compared to GPUs' external HBM)
  * Minimal batching requirements for efficient operation
  * Approximately one-third the power consumption of equivalent GPU solutions
  * Purpose-designed from the ground up for inference workloads

===== Speed Benchmarks =====

Groq LPUs deliver dramatically faster inference compared to GPU-based solutions:

^ Metric ^ GPU (e.g., NVIDIA H100) ^ Groq LPU ^
| Token Generation Speed | 50-100 tokens/sec | 500-1,000+ tokens/sec |
| Relative Performance | Baseline | 5-10x faster |
| Latency Characteristics | Variable | Predictable, low |
| Power Efficiency | Moderate | High (approx. 1/3 GPU power) |

Groq has claimed that ChatGPT could run 13x faster on LPU infrastructure. The LPU overcomes GPU bottlenecks in memory bandwidth and sequential processing, enabling real-time applications such as conversational AI and interactive agents.((source [[https://www.voiceflow.com/blog/groq|Groq Overview]]))

===== GroqCloud API =====

**GroqCloud** provides an OpenAI-compatible API supporting text, audio, and vision models with scalable, predictable pricing. The platform supports open-source models exclusively, including:

  * **Llama 4** variants
  * **Qwen 3 32B**
  * **Mixtral 8x7B**
  * **Llama 3 70B**

A **free tier** is available for experimentation and development, available since January 2024.((source [[https://groq.com/groqcloud|GroqCloud]]))

===== Pricing =====

Groq emphasizes low-cost inference enabled by LPU efficiency. Users have reported up to 89% cost reduction compared to GPU-based alternatives. The platform offers plan-based pricing scaled to usage volume.

===== Recent Developments =====

  * GroqCloud optimizations boosted chat speed 7.41x while reducing costs 89%
  * **NVIDIA licensing deal** ($20 billion): Groq 3 LPX technology licensed for integration into NVIDIA Rubin GPUs, targeting 35x higher throughput per megawatt on trillion-parameter models((source [[https://www.networkworld.com/article/4146684/nvidia-targets-inference-as-ais-next-battleground-with-groq-3-lpx.html|NVIDIA and Groq 3 LPX]]))
  * Gen 4 LPU co-designed with TSMC on the Feynman platform
  * Partnership with IBM watsonx for enterprise integration (October 2025)
  * Global expansion including European data centers

===== See Also =====

  * [[cerebras_inference|Cerebras Inference]]
  * [[together_ai|Together AI]]
  * [[fireworks_ai|Fireworks AI]]

===== References =====