AWS SageMaker

Amazon SageMaker is AWS's fully managed machine learning platform that provides tools for building, training, deploying, and managing ML models at scale. Since its launch, SageMaker has evolved into a comprehensive AI/ML development environment spanning from data preparation to production inference, with recent additions focused on foundation model training and unified data analytics. ¹⁾

Key Components

SageMaker Unified Studio

Launched in March 2025, SageMaker Unified Studio is a collaborative workspace built on top of Amazon DataZone that unifies data and AI asset management across an organization:

Asset catalog — Publish and discover tables, models, agents, and reports in a central catalog with metadata enrichment
Serverless notebooks — Launch notebooks in seconds with built-in AI agents for code generation from natural language
One-click onboarding — Simplified setup for new team members
Multi-engine support — SQL queries, Python code, and Spark jobs in a single workspace
Data connectivity — Automatic connections to S3, Redshift, and third-party databases

Adopted by companies including Bayer, NatWest, and Carrier. Over 200 improvements were delivered in 2025. ²⁾

SageMaker HyperPod

HyperPod provisions resilient clusters for training large-scale AI models including LLMs, diffusion models, and foundation models:

Automatic fault recovery — Detects and replaces faulty nodes without manual intervention
Observability — Unified Grafana dashboards for GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and Kubernetes pod lifecycle ³⁾
Restricted Instance Groups (RIG) — Dedicated compute partitions for foundation model training with Nova Forge
Quota validation — Automatic pre-creation checks for instance limits, EBS volumes, and VPC quotas ⁴⁾
EKS integration — Kubernetes-based orchestration of training clusters

Additional Capabilities

SageMaker Training — Managed training jobs with distributed training, spot instances, and automatic model tuning
SageMaker Endpoints — Real-time and batch inference endpoints with auto-scaling
SageMaker Pipelines — CI/CD for ML workflows
SageMaker Feature Store — Centralized feature repository for training and inference
SageMaker Ground Truth — Data labeling service with human and automated workflows
SageMaker Model Monitor — Continuous monitoring for data drift and model quality