Primus-SaFE is AMD's comprehensive, full-stack platform designed to address the critical challenges of multi-node AI training on AMD GPU clusters. Training large AI models demands unwavering stability and robust debugging capabilities at cluster scale, yet today's ROCm-based multi-node GPU deployments often rely on brittle scripts and disjointed tools to launch distributed jobs, monitor performance, and recover from failures.
Primus-SaFE transforms a collection of AMD GPU servers into a resilient, self-monitoring environment for next-generation model training by automating everything from cluster provisioning and intelligent job scheduling to real-time monitoring and hardware health validation. Running atop Kubernetes and integrated with the ROCm software stack, it provides:
- Automated Cluster Deployment: Production-grade Kubernetes setup with optimized infrastructure
- Intelligent Job Scheduling: Multi-priority queues, automatic failover, and topology-aware placement
- Full-Stack Observability: Real-time metrics, interactive dashboards, and comprehensive telemetry
- Preflight Validation: Rigorous health checks and performance benchmarking before workload deployment
- Fault Tolerance: Automatic recovery mechanisms to minimize downtime and maximize goodput
| Module | Role | Key Features | Dependencies / Integration |
|---|---|---|---|
| Primus-LM | End-to-end training framework | - Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Turbo and SaFE |
- Can invoke Primus-Turbo kernels and modules - Runs on top of Primus-SaFE for stable scheduling |
| Primus-Turbo | High-performance operators & modules | - Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.) - Modular design, directly pluggable into Primus-LM - Optimized for different architectures and precisions |
- Built on AITER, CK, hipBLASLt, Triton and other operator libraries - Can be enabled via configuration inside Primus-LM |
| Primus-SaFE | Stability & platform layer | - Cluster sanity check and benchmarking - Kubernets scheduling with topology awareness - Fault tolerance - Stability enhancements |
- Building a training platform based on the K8s and Slurm ecosystem |
- Rapid Deployment: Automated Kubernetes cluster provisioning on bare-metal servers
- Unified Storage: JuiceFS for high-throughput distributed file system with client-side caching
- Secure Registry: Harbor for trusted container image management with RBAC
- Scalable Gateway: Higress API gateway for high-concurrency user-facing services
- Multi-Priority Queuing: Preemption support for urgent experiments
- Automatic Failover: Resume training from checkpoints when nodes or GPUs fail
- Topology-Aware Placement: Network-locality optimization for distributed training
- Gang Scheduling: Co-schedule interdependent pods for distributed workloads
- Health Validation: Preflight checks before scheduling production workloads
- Cluster-Wide Metrics: Real-time GPU, CPU, memory, network, and I/O telemetry
- Job-Level Tracking: Training progress, throughput, loss curves, and checkpoint events
- Interactive Dashboards: Grafana-based visualization with custom alerts
- Root Cause Analysis: Correlated metrics and logs for quick issue diagnosis
- Hardware Diagnostics: Verify GPUs, drivers, and network interfaces
- Micro-Benchmarks: Standard AI operations to detect underperforming nodes
- Trial Runs: Small-scale training jobs to validate end-to-end functionality
- Performance Baselines: Ensure all nodes meet minimum performance standards
- Tested on clusters from a few GPUs to tens of thousands of GPUs
- Architecture designed to handle 100,000+ GPU accelerators
- Scales from small lab setups to massive enterprise deployments
Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Follow these steps to deploy the complete Primus-SaFE platform on your AMD GPU cluster:
git clone https://github.com/AMD-AGI/Primus-SaFE.git
cd Primus-SaFEcd Bootstrap
# Edit hosts.ini to specify your cluster nodes and roles
vim hosts.ini
# Run the bootstrap script to deploy Kubernetes
bash bootstrap.shThis deploys a production-grade Kubernetes cluster with:
- High-availability control plane and etcd
- JuiceFS distributed storage
- Harbor container registry
- Higress API gateway
See Bootstrap/README.md for detailed configuration options.
cd ../Lens/bootstrap
# Install Primus-Lens monitoring and logging components
bash install.shThis installs:
- VictoriaMetrics for time-series metrics storage
- Grafana for dashboards and visualization
- OpenSearch for log aggregation
- Custom exporters for GPU, network, and workload metrics
See Lens/README.md for configuration details.
cd ../../SaFE/bootstrap
# Deploy Primus-SaFE stability and scheduling components
bash install.shThis installs:
- API Server for unified management interface
- Job Manager for workload lifecycle management
- Resource Manager for cluster and node management
- Node Agent for health monitoring
- Webhooks for request validation and modification
See SaFE/README.md for detailed installation guide.
cd ../../Bench
# Configure benchmark settings
vim config.sh
# Edit hosts.ini to specify nodes to benchmark
vim hosts.ini
# Run comprehensive health checks and benchmarks
bash run_bare_metal.shThis runs:
- SSH connectivity and configuration validation
- Node hardware and system health checks
- Network performance and reliability tests (with automatic retry and unhealthy node filtering)
- I/O performance benchmarks (optional)
- Computation-communication overlap measurements
- Kernel launch overhead analysis
- Generates detailed health reports with pass/fail node inventory
See Bench/README.md for different execution modes (bare-metal, SLURM, local, Kubernetes).
Location: Bootstrap/
Primus-Bootstrap automates the deployment of a production-grade Kubernetes cluster on bare-metal servers, provisioning key infrastructure components optimized for AI workloads.
Key Components:
- Kubernetes Cluster: High-availability setup using Kubespray with redundant control-plane and etcd
- JuiceFS Storage: Distributed file system with metadata/data separation and client-side caching
- Harbor Registry: Secure container image management with RBAC and image scanning
- Higress Gateway: High-concurrency API gateway with WebAssembly plugin support
Usage:
cd Bootstrap
vim hosts.ini # Configure your cluster nodes
bash bootstrap.shLearn More: Bootstrap/README.md
Location: SaFE/
The Primus-SaFE Core extends Kubernetes with AI-specific scheduling and fault tolerance capabilities to maximize throughput and reliability for long-running training jobs.
Core Services:
- API Server: Unified interface for resource and workload management, user authentication, and SSH access
- Job Manager: Full lifecycle management of PyTorchJob, Job, Deployment with intelligent scheduling, queuing, and automatic retry
- Resource Manager: Centralized management of clusters, nodes, workspaces, storage, and operations
- Webhooks: Kubernetes admission controller for request validation and resource modification
- Node Agent: Node-level monitoring, fault detection, and self-healing capabilities
Key Features:
- Multi-priority queues with preemption
- Automatic failover and checkpoint resume
- Topology-aware placement (via custom scheduler plugins)
- Preflight validation with Primus-Bench integration
- Multi-tenant workspace isolation
Usage:
cd SaFE/bootstrap
bash install.shLearn More: SaFE/README.md
Location: Lens/
Primus-Lens provides comprehensive observability across infrastructure and training workloads with real-time metrics, logs, and interactive dashboards.
Components:
-
Metrics Stack:
- VictoriaMetrics cluster for high-performance time-series storage
- Custom exporters: GPU resources, network statistics, node hardware, workloads
- VMAgent for metrics collection and aggregation
-
Logging Stack:
- OpenSearch for distributed log storage and search
- Fluent Bit for log collection and forwarding
- Structured logging with correlation to metrics
-
Visualization:
- Grafana with pre-built dashboards for cluster, node, GPU, and job metrics
- Custom alerts and notifications
- Training toolkit integration for job-level telemetry
-
Storage:
- PostgreSQL for metadata and configuration storage
Key Metrics:
- GPU utilization, memory, temperature, power
- Network throughput, latency, packet loss, RDMA statistics
- Storage I/O, filesystem performance
- Training progress, throughput, loss curves, checkpoint timing
- Pod lifecycle, resource allocation, scheduling latency
Usage:
cd Lens/bootstrap
bash install.shConfiguration: Edit bootstrap/manifests/*.yaml.tpl templates before installation.
Location: Bench/
Primus-Bench provides comprehensive node validation and performance benchmarking to ensure every node in the cluster meets baseline standards before running production training workloads. It intelligently filters out unhealthy nodes and generates detailed health reports.
Test Workflow:
-
SSH Preflight:
- Establishes SSH connectivity across all nodes
- Synchronizes SSH keys and configurations
- Validates node accessibility
-
Node Health Checks:
- Hardware validation (GPUs, ROCm drivers, memory)
- System configuration verification
- Environment variable validation
- Network interface verification
- RDMA device availability checks
-
Network Diagnostics:
- Multi-node all-reduce performance tests
- All-to-all communication tests
- InfiniBand bandwidth validation (ib_write_bw)
- Automatic retry with unhealthy node removal
- Network topology validation
-
Performance Benchmarks:
- I/O Benchmarks (optional): FIO and IOR tests for storage performance
- Computation-Communication Overlap: Measures ability to overlap compute and communication
- Kernel Launch Overhead: Evaluates GPU kernel dispatch latency
- Results exported as JSON for analysis
-
Health Report Generation:
- Comprehensive node status summary
- Failed nodes categorization (node check vs network check)
- Healthy nodes inventory for subsequent operations
- Detailed logs for troubleshooting
Execution Modes:
- Bare Metal Mode: Direct execution on bare-metal servers with Docker installation
- Local Mode: Container-based execution for single or multi-node setups
- SLURM Mode: Integration with SLURM job scheduler for HPC environments
- Kubernetes Mode: PyTorchJob-based execution in Kubernetes clusters
Usage:
cd Bench
# Configure your environment (edit as needed)
vim config.sh
# Bare metal mode with automatic Docker installation
vim hosts.ini # Configure target nodes
bash run_bare_metal.sh
# SLURM mode (with automatic resource allocation)
NNODES=4 bash run_slurm.sh
# SLURM mode (within existing allocation)
bash run_slurm.sh --no-allocate
# Local mode (for testing on current node)
bash run_local.sh
# Kubernetes mode
kubectl apply -f kubernetes/pytorchjob.yamlOutput Structure:
outputs/
└── YYYY-MM-DD_HH-MM-SS/
├── primusbench.log # Main execution log
├── preflight_node.log # Node check details
├── preflight_network.log # Network check results
├── bench_report.txt # Summary health report
├── overlap_results.json # CCO benchmark results
├── kernel_overhead_results.json # Kernel launch metrics
└── <node_name>/ # Per-node logs and results
├── cco.log
└── kernel_launch.log
Configuration: All settings are centralized in config.sh, including container images, network interfaces, GPU settings, and benchmark parameters.
Learn More: Bench/README.md
Location: Scheduler-Plugins/
Custom Kubernetes scheduler plugins that extend the default scheduler with advanced capabilities tailored for AI workloads.
Available Plugins:
- TopologyIPSort:
- IP-based node sorting for network locality
- Pod group support for gang scheduling
- Priority-based queue sorting
- Co-scheduling for distributed workloads
- Kubeflow training workload integration
Extension Points:
- Score Plugin: Node evaluation and scoring
- Queue Sort Plugin: Custom pod ordering
- Permit Plugin: Co-scheduling and admission control
- Post Filter Plugin: Diagnostics and failure handling
Usage:
cd Scheduler-Plugins
# Install via Helm
helm install scheduler-plugins manifests/charts/scheduler-plugins/
# Use in pod specifications
spec:
schedulerName: custom-schedulerLearn More: Scheduler-Plugins/README.md
Primus-SaFE is under active development. Planned enhancements include:
-
Enhanced AMD Hardware Support:
- MI450 series GPU integration
- Latest ROCm stack compatibility
- 400 Gbps AI NIC optimization
-
Advanced Fault Tolerance:
- Process-level failover mechanisms
- Asynchronous checkpointing
- Redundant training processes
- Graceful degradation strategies
-
Agentic Platform Automation:
- Multi-agent systems (LangGraph, CrewAI) for cluster operations
- Natural-language cluster management
- Automated deployment and optimization
- Self-healing and self-tuning capabilities
-
AI-Powered Operations:
- Predictive failure detection
- Automated performance optimization
- Intelligent resource allocation
This project is licensed under the Apache License 2.0. See the LICENSE file for full details.