Skip to content

AMD-AGI/Primus-SaFE

Repository files navigation

Primus-SaFE

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

License Kubernetes ROCm codecov Unit Tests


Overview

Primus-SaFE is AMD's comprehensive, full-stack platform designed to address the critical challenges of multi-node AI training on AMD GPU clusters. Training large AI models demands unwavering stability and robust debugging capabilities at cluster scale, yet today's ROCm-based multi-node GPU deployments often rely on brittle scripts and disjointed tools to launch distributed jobs, monitor performance, and recover from failures.

Primus-SaFE transforms a collection of AMD GPU servers into a resilient, self-monitoring environment for next-generation model training by automating everything from cluster provisioning and intelligent job scheduling to real-time monitoring and hardware health validation. Running atop Kubernetes and integrated with the ROCm software stack, it provides:

  • Automated Cluster Deployment: Production-grade Kubernetes setup with optimized infrastructure
  • Intelligent Job Scheduling: Multi-priority queues, automatic failover, and topology-aware placement
  • Full-Stack Observability: Real-time metrics, interactive dashboards, and comprehensive telemetry
  • Preflight Validation: Rigorous health checks and performance benchmarking before workload deployment
  • Fault Tolerance: Automatic recovery mechanisms to minimize downtime and maximize goodput

🧩 Primus Product Matrix

Module Role Key Features Dependencies / Integration
Primus-LM End-to-end training framework - Supports multiple training backends (Megatron, TorchTitan, etc.)
- Provides high-performance, scalable distributed training
- Deeply integrates with Turbo and SaFE
- Can invoke Primus-Turbo kernels and modules
- Runs on top of Primus-SaFE for stable scheduling
Primus-Turbo High-performance operators & modules - Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.)
- Modular design, directly pluggable into Primus-LM
- Optimized for different architectures and precisions
- Built on AITER, CK, hipBLASLt, Triton and other operator libraries
- Can be enabled via configuration inside Primus-LM
Primus-SaFE Stability & platform layer - Cluster sanity check and benchmarking
- Kubernets scheduling with topology awareness
- Fault tolerance
- Stability enhancements
- Building a training platform based on the K8s and Slurm ecosystem

Table of Contents


Key Features

🚀 High-Availability, High-Performance Infrastructure

  • Rapid Deployment: Automated Kubernetes cluster provisioning on bare-metal servers
  • Unified Storage: JuiceFS for high-throughput distributed file system with client-side caching
  • Secure Registry: Harbor for trusted container image management with RBAC
  • Scalable Gateway: Higress API gateway for high-concurrency user-facing services

🧠 Intelligent Scheduling and Resilience

  • Multi-Priority Queuing: Preemption support for urgent experiments
  • Automatic Failover: Resume training from checkpoints when nodes or GPUs fail
  • Topology-Aware Placement: Network-locality optimization for distributed training
  • Gang Scheduling: Co-schedule interdependent pods for distributed workloads
  • Health Validation: Preflight checks before scheduling production workloads

📊 Comprehensive Monitoring and Insight

  • Cluster-Wide Metrics: Real-time GPU, CPU, memory, network, and I/O telemetry
  • Job-Level Tracking: Training progress, throughput, loss curves, and checkpoint events
  • Interactive Dashboards: Grafana-based visualization with custom alerts
  • Root Cause Analysis: Correlated metrics and logs for quick issue diagnosis

Preflight Validation and Performance Consistency

  • Hardware Diagnostics: Verify GPUs, drivers, and network interfaces
  • Micro-Benchmarks: Standard AI operations to detect underperforming nodes
  • Trial Runs: Small-scale training jobs to validate end-to-end functionality
  • Performance Baselines: Ensure all nodes meet minimum performance standards

📈 Extreme Scalability

  • Tested on clusters from a few GPUs to tens of thousands of GPUs
  • Architecture designed to handle 100,000+ GPU accelerators
  • Scales from small lab setups to massive enterprise deployments

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform: Figure 1. Primus-SaFE Full-Stack Architecture

Quick Start

Follow these steps to deploy the complete Primus-SaFE platform on your AMD GPU cluster:

1. Clone the Repository

git clone https://github.com/AMD-AGI/Primus-SaFE.git
cd Primus-SaFE

2. Bootstrap the Kubernetes Cluster

cd Bootstrap

# Edit hosts.ini to specify your cluster nodes and roles
vim hosts.ini

# Run the bootstrap script to deploy Kubernetes
bash bootstrap.sh

This deploys a production-grade Kubernetes cluster with:

  • High-availability control plane and etcd
  • JuiceFS distributed storage
  • Harbor container registry
  • Higress API gateway

See Bootstrap/README.md for detailed configuration options.

3. Deploy Observability with Primus-Lens

cd ../Lens/bootstrap

# Install Primus-Lens monitoring and logging components
bash install.sh

This installs:

  • VictoriaMetrics for time-series metrics storage
  • Grafana for dashboards and visualization
  • OpenSearch for log aggregation
  • Custom exporters for GPU, network, and workload metrics

See Lens/README.md for configuration details.

4. Install the Primus-SaFE Platform Layer

cd ../../SaFE/bootstrap

# Deploy Primus-SaFE stability and scheduling components
bash install.sh

This installs:

  • API Server for unified management interface
  • Job Manager for workload lifecycle management
  • Resource Manager for cluster and node management
  • Node Agent for health monitoring
  • Webhooks for request validation and modification

See SaFE/README.md for detailed installation guide.

5. (Optional) Run Health Checks with Primus-Bench

cd ../../Bench

# Configure benchmark settings
vim config.sh

# Edit hosts.ini to specify nodes to benchmark
vim hosts.ini

# Run comprehensive health checks and benchmarks
bash run_bare_metal.sh

This runs:

  • SSH connectivity and configuration validation
  • Node hardware and system health checks
  • Network performance and reliability tests (with automatic retry and unhealthy node filtering)
  • I/O performance benchmarks (optional)
  • Computation-communication overlap measurements
  • Kernel launch overhead analysis
  • Generates detailed health reports with pass/fail node inventory

See Bench/README.md for different execution modes (bare-metal, SLURM, local, Kubernetes).


Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Location: Bootstrap/

Primus-Bootstrap automates the deployment of a production-grade Kubernetes cluster on bare-metal servers, provisioning key infrastructure components optimized for AI workloads.

Key Components:

  • Kubernetes Cluster: High-availability setup using Kubespray with redundant control-plane and etcd
  • JuiceFS Storage: Distributed file system with metadata/data separation and client-side caching
  • Harbor Registry: Secure container image management with RBAC and image scanning
  • Higress Gateway: High-concurrency API gateway with WebAssembly plugin support

Usage:

cd Bootstrap
vim hosts.ini  # Configure your cluster nodes
bash bootstrap.sh

Learn More: Bootstrap/README.md


Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Location: SaFE/

The Primus-SaFE Core extends Kubernetes with AI-specific scheduling and fault tolerance capabilities to maximize throughput and reliability for long-running training jobs.

Core Services:

  1. API Server: Unified interface for resource and workload management, user authentication, and SSH access
  2. Job Manager: Full lifecycle management of PyTorchJob, Job, Deployment with intelligent scheduling, queuing, and automatic retry
  3. Resource Manager: Centralized management of clusters, nodes, workspaces, storage, and operations
  4. Webhooks: Kubernetes admission controller for request validation and resource modification
  5. Node Agent: Node-level monitoring, fault detection, and self-healing capabilities

Key Features:

  • Multi-priority queues with preemption
  • Automatic failover and checkpoint resume
  • Topology-aware placement (via custom scheduler plugins)
  • Preflight validation with Primus-Bench integration
  • Multi-tenant workspace isolation

Usage:

cd SaFE/bootstrap
bash install.sh

Learn More: SaFE/README.md


Primus-Lens: Full-Stack Observability & Visualization

Location: Lens/

Primus-Lens provides comprehensive observability across infrastructure and training workloads with real-time metrics, logs, and interactive dashboards.

Components:

  1. Metrics Stack:

    • VictoriaMetrics cluster for high-performance time-series storage
    • Custom exporters: GPU resources, network statistics, node hardware, workloads
    • VMAgent for metrics collection and aggregation
  2. Logging Stack:

    • OpenSearch for distributed log storage and search
    • Fluent Bit for log collection and forwarding
    • Structured logging with correlation to metrics
  3. Visualization:

    • Grafana with pre-built dashboards for cluster, node, GPU, and job metrics
    • Custom alerts and notifications
    • Training toolkit integration for job-level telemetry
  4. Storage:

    • PostgreSQL for metadata and configuration storage

Key Metrics:

  • GPU utilization, memory, temperature, power
  • Network throughput, latency, packet loss, RDMA statistics
  • Storage I/O, filesystem performance
  • Training progress, throughput, loss curves, checkpoint timing
  • Pod lifecycle, resource allocation, scheduling latency

Usage:

cd Lens/bootstrap
bash install.sh

Configuration: Edit bootstrap/manifests/*.yaml.tpl templates before installation.


Primus-Bench: Node Health Checks and Performance Benchmarking

Location: Bench/

Primus-Bench provides comprehensive node validation and performance benchmarking to ensure every node in the cluster meets baseline standards before running production training workloads. It intelligently filters out unhealthy nodes and generates detailed health reports.

Test Workflow:

  1. SSH Preflight:

    • Establishes SSH connectivity across all nodes
    • Synchronizes SSH keys and configurations
    • Validates node accessibility
  2. Node Health Checks:

    • Hardware validation (GPUs, ROCm drivers, memory)
    • System configuration verification
    • Environment variable validation
    • Network interface verification
    • RDMA device availability checks
  3. Network Diagnostics:

    • Multi-node all-reduce performance tests
    • All-to-all communication tests
    • InfiniBand bandwidth validation (ib_write_bw)
    • Automatic retry with unhealthy node removal
    • Network topology validation
  4. Performance Benchmarks:

    • I/O Benchmarks (optional): FIO and IOR tests for storage performance
    • Computation-Communication Overlap: Measures ability to overlap compute and communication
    • Kernel Launch Overhead: Evaluates GPU kernel dispatch latency
    • Results exported as JSON for analysis
  5. Health Report Generation:

    • Comprehensive node status summary
    • Failed nodes categorization (node check vs network check)
    • Healthy nodes inventory for subsequent operations
    • Detailed logs for troubleshooting

Execution Modes:

  • Bare Metal Mode: Direct execution on bare-metal servers with Docker installation
  • Local Mode: Container-based execution for single or multi-node setups
  • SLURM Mode: Integration with SLURM job scheduler for HPC environments
  • Kubernetes Mode: PyTorchJob-based execution in Kubernetes clusters

Usage:

cd Bench

# Configure your environment (edit as needed)
vim config.sh

# Bare metal mode with automatic Docker installation
vim hosts.ini  # Configure target nodes
bash run_bare_metal.sh

# SLURM mode (with automatic resource allocation)
NNODES=4 bash run_slurm.sh

# SLURM mode (within existing allocation)
bash run_slurm.sh --no-allocate

# Local mode (for testing on current node)
bash run_local.sh

# Kubernetes mode
kubectl apply -f kubernetes/pytorchjob.yaml

Output Structure:

outputs/
└── YYYY-MM-DD_HH-MM-SS/
    ├── primusbench.log          # Main execution log
    ├── preflight_node.log        # Node check details
    ├── preflight_network.log     # Network check results
    ├── bench_report.txt          # Summary health report
    ├── overlap_results.json      # CCO benchmark results
    ├── kernel_overhead_results.json  # Kernel launch metrics
    └── <node_name>/              # Per-node logs and results
        ├── cco.log
        └── kernel_launch.log

Configuration: All settings are centralized in config.sh, including container images, network interfaces, GPU settings, and benchmark parameters.

Learn More: Bench/README.md


Scheduler Plugins: Advanced Scheduling Capabilities

Location: Scheduler-Plugins/

Custom Kubernetes scheduler plugins that extend the default scheduler with advanced capabilities tailored for AI workloads.

Available Plugins:

  1. TopologyIPSort:
    • IP-based node sorting for network locality
    • Pod group support for gang scheduling
    • Priority-based queue sorting
    • Co-scheduling for distributed workloads
    • Kubeflow training workload integration

Extension Points:

  • Score Plugin: Node evaluation and scoring
  • Queue Sort Plugin: Custom pod ordering
  • Permit Plugin: Co-scheduling and admission control
  • Post Filter Plugin: Diagnostics and failure handling

Usage:

cd Scheduler-Plugins

# Install via Helm
helm install scheduler-plugins manifests/charts/scheduler-plugins/

# Use in pod specifications
spec:
  schedulerName: custom-scheduler

Learn More: Scheduler-Plugins/README.md


Roadmap

Primus-SaFE is under active development. Planned enhancements include:

  • Enhanced AMD Hardware Support:

    • MI450 series GPU integration
    • Latest ROCm stack compatibility
    • 400 Gbps AI NIC optimization
  • Advanced Fault Tolerance:

    • Process-level failover mechanisms
    • Asynchronous checkpointing
    • Redundant training processes
    • Graceful degradation strategies
  • Agentic Platform Automation:

    • Multi-agent systems (LangGraph, CrewAI) for cluster operations
    • Natural-language cluster management
    • Automated deployment and optimization
    • Self-healing and self-tuning capabilities
  • AI-Powered Operations:

    • Predictive failure detection
    • Automated performance optimization
    • Intelligent resource allocation

License

This project is licensed under the Apache License 2.0. See the LICENSE file for full details.

About

Primus-SaFE(Stability and Fault Endurance)

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 10