====== AWS SageMaker ====== **Amazon SageMaker** is AWS's fully managed machine learning platform that provides tools for building, training, deploying, and managing ML models at scale. Since its launch, SageMaker has evolved into a comprehensive AI/ML development environment spanning from data preparation to production inference, with recent additions focused on foundation model training and unified data analytics. ((Source: [[https://aws.amazon.com/blogs/big-data/get-started-faster-with-one-click-onboarding-serverless-notebooks-and-ai-agents-in-amazon-sagemaker-unified-studio/|AWS — SageMaker Unified Studio]])) ===== Key Components ===== ==== SageMaker Unified Studio ==== Launched in March 2025, **SageMaker Unified Studio** is a collaborative workspace built on top of Amazon DataZone that unifies data and AI asset management across an organization: * **Asset catalog** — Publish and discover tables, models, agents, and reports in a central catalog with metadata enrichment * **Serverless notebooks** — Launch notebooks in seconds with built-in AI agents for code generation from natural language * **One-click onboarding** — Simplified setup for new team members * **Multi-engine support** — SQL queries, Python code, and Spark jobs in a single workspace * **Data connectivity** — Automatic connections to S3, Redshift, and third-party databases Adopted by companies including Bayer, NatWest, and Carrier. Over 200 improvements were delivered in 2025. ((Source: [[https://medium.com/@marccampora/sagemaker-unified-studio-new-features-in-2026-5592ad0e8e41|Marc Campora — SageMaker Unified Studio 2026]])) ==== SageMaker HyperPod ==== **HyperPod** provisions resilient clusters for training large-scale AI models including LLMs, diffusion models, and foundation models: * **Automatic fault recovery** — Detects and replaces faulty nodes without manual intervention * **Observability** — Unified Grafana dashboards for GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and Kubernetes pod lifecycle ((Source: [[https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-sagemaker-hyperpod-observability-rig/|AWS — HyperPod Observability for RIG]])) * **Restricted Instance Groups (RIG)** — Dedicated compute partitions for foundation model training with Nova Forge * **Quota validation** — Automatic pre-creation checks for instance limits, EBS volumes, and VPC quotas ((Source: [[https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-sagemaker-hyperpod-validates-service-quotas/|AWS — HyperPod Quota Validation]])) * **EKS integration** — Kubernetes-based orchestration of training clusters ==== Additional Capabilities ==== * **SageMaker Training** — Managed training jobs with distributed training, spot instances, and automatic model tuning * **SageMaker Endpoints** — Real-time and batch inference endpoints with auto-scaling * **SageMaker Pipelines** — CI/CD for ML workflows * **SageMaker Feature Store** — Centralized feature repository for training and inference * **SageMaker Ground Truth** — Data labeling service with human and automated workflows * **SageMaker Model Monitor** — Continuous monitoring for data drift and model quality ===== AWS Integration ===== SageMaker integrates deeply with the AWS ecosystem: * **S3** for data storage and model artifacts * **EC2** (including P5, Trn1, Inf2 instances) for compute * **FSx for Lustre** for high-performance training data access * **CloudWatch** and **Managed Grafana** for monitoring * **IAM** for access control and security * **Lambda** for serverless inference triggers * **Step Functions** for workflow orchestration ===== See Also ===== * [[unified_data_fabric|Unified Data Fabric for AI]] * [[ai_superfactory|AI Superfactory]] ===== References =====