====== AWS SageMaker ======

**Amazon SageMaker** is AWS's fully managed machine learning platform that provides tools for building, training, deploying, and managing ML models at scale. Since its launch, SageMaker has evolved into a comprehensive AI/ML development environment spanning from data preparation to production inference, with recent additions focused on foundation model training and unified data analytics. ((Source: [[https://aws.amazon.com/blogs/big-data/get-started-faster-with-one-click-onboarding-serverless-notebooks-and-ai-agents-in-amazon-sagemaker-unified-studio/|AWS — SageMaker Unified Studio]]))

===== Key Components =====

==== SageMaker Unified Studio ====

Launched in March 2025, **SageMaker Unified Studio** is a collaborative workspace built on top of Amazon DataZone that unifies data and AI asset management across an organization:

  * **Asset catalog** — Publish and discover tables, models, agents, and reports in a central catalog with metadata enrichment
  * **Serverless notebooks** — Launch notebooks in seconds with built-in AI agents for code generation from natural language
  * **One-click onboarding** — Simplified setup for new team members
  * **Multi-engine support** — SQL queries, Python code, and Spark jobs in a single workspace
  * **Data connectivity** — Automatic connections to S3, Redshift, and third-party databases

Adopted by companies including Bayer, NatWest, and Carrier. Over 200 improvements were delivered in 2025. ((Source: [[https://medium.com/@marccampora/sagemaker-unified-studio-new-features-in-2026-5592ad0e8e41|Marc Campora — SageMaker Unified Studio 2026]]))

==== SageMaker HyperPod ====

**HyperPod** provisions resilient clusters for training large-scale AI models including LLMs, diffusion models, and foundation models:

  * **Automatic fault recovery** — Detects and replaces faulty nodes without manual intervention
  * **Observability** — Unified Grafana dashboards for GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and Kubernetes pod lifecycle ((Source: [[https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-sagemaker-hyperpod-observability-rig/|AWS — HyperPod Observability for RIG]]))
  * **Restricted Instance Groups (RIG)** — Dedicated compute partitions for foundation model training with Nova Forge
  * **Quota validation** — Automatic pre-creation checks for instance limits, EBS volumes, and VPC quotas ((Source: [[https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-sagemaker-hyperpod-validates-service-quotas/|AWS — HyperPod Quota Validation]]))
  * **EKS integration** — Kubernetes-based orchestration of training clusters

==== Additional Capabilities ====

  * **SageMaker Training** — Managed training jobs with distributed training, spot instances, and automatic model tuning
  * **SageMaker Endpoints** — Real-time and batch inference endpoints with auto-scaling
  * **SageMaker Pipelines** — CI/CD for ML workflows
  * **SageMaker Feature Store** — Centralized feature repository for training and inference
  * **SageMaker Ground Truth** — Data labeling service with human and automated workflows
  * **SageMaker Model Monitor** — Continuous monitoring for data drift and model quality

===== AWS Integration =====

SageMaker integrates deeply with the AWS ecosystem:

  * **S3** for data storage and model artifacts
  * **EC2** (including P5, Trn1, Inf2 instances) for compute
  * **FSx for Lustre** for high-performance training data access
  * **CloudWatch** and **Managed Grafana** for monitoring
  * **IAM** for access control and security
  * **Lambda** for serverless inference triggers
  * **Step Functions** for workflow orchestration

===== See Also =====

  * [[unified_data_fabric|Unified Data Fabric for AI]]
  * [[ai_superfactory|AI Superfactory]]

===== References =====