AWS SageMaker
Amazon SageMaker is AWS's fully managed machine learning platform that provides tools for building, training, deploying, and managing ML models at scale. Since its launch, SageMaker has evolved into a comprehensive AI/ML development environment spanning from data preparation to production inference, with recent additions focused on foundation model training and unified data analytics. 1)
Key Components
SageMaker Unified Studio
Launched in March 2025, SageMaker Unified Studio is a collaborative workspace built on top of Amazon DataZone that unifies data and AI asset management across an organization:
Asset catalog — Publish and discover tables, models, agents, and reports in a central catalog with metadata enrichment
Serverless notebooks — Launch notebooks in seconds with built-in AI agents for code generation from natural language
One-click onboarding — Simplified setup for new team members
Multi-engine support — SQL queries, Python code, and Spark jobs in a single workspace
Data connectivity — Automatic connections to S3, Redshift, and third-party databases
Adopted by companies including Bayer, NatWest, and Carrier. Over 200 improvements were delivered in 2025. 2)
SageMaker HyperPod
HyperPod provisions resilient clusters for training large-scale AI models including LLMs, diffusion models, and foundation models:
Automatic fault recovery — Detects and replaces faulty nodes without manual intervention
Observability — Unified Grafana dashboards for GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and Kubernetes pod lifecycle
3)
Restricted Instance Groups (RIG) — Dedicated compute partitions for foundation model training with Nova Forge
Quota validation — Automatic pre-creation checks for instance limits, EBS volumes, and VPC quotas
4)
EKS integration — Kubernetes-based orchestration of training clusters
Additional Capabilities
SageMaker Training — Managed training jobs with distributed training, spot instances, and automatic model tuning
SageMaker Endpoints — Real-time and batch inference endpoints with auto-scaling
SageMaker Pipelines — CI/CD for ML workflows
SageMaker Feature Store — Centralized feature repository for training and inference
SageMaker Ground Truth — Data labeling service with human and automated workflows
SageMaker Model Monitor — Continuous monitoring for data drift and model quality
AWS Integration
SageMaker integrates deeply with the AWS ecosystem:
S3 for data storage and model artifacts
EC2 (including P5, Trn1, Inf2 instances) for compute
FSx for Lustre for high-performance training data access
CloudWatch and Managed Grafana for monitoring
IAM for access control and security
Lambda for serverless inference triggers
Step Functions for workflow orchestration
See Also
References