Primus-SaFE

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Overview

Primus-SaFE is AMD's comprehensive, full-stack platform designed to address the critical challenges of multi-node AI training on AMD GPU clusters. Training large AI models demands unwavering stability and robust debugging capabilities at cluster scale, yet today's ROCm-based multi-node GPU deployments often rely on brittle scripts and disjointed tools to launch distributed jobs, monitor performance, and recover from failures.

Primus-SaFE transforms a collection of AMD GPU servers into a resilient, self-monitoring environment for next-generation model training by automating everything from cluster provisioning and intelligent job scheduling to real-time monitoring and hardware health validation. Running atop Kubernetes and integrated with the ROCm software stack, it provides:

Automated Cluster Deployment: Production-grade Kubernetes setup with optimized infrastructure
Intelligent Job Scheduling: Multi-priority queues, automatic failover, and topology-aware placement
Full-Stack Observability: Real-time metrics, interactive dashboards, and comprehensive telemetry
Preflight Validation: Rigorous health checks and performance benchmarking before workload deployment
Fault Tolerance: Automatic recovery mechanisms to minimize downtime and maximize goodput

🧩 Primus Product Matrix

Module	Role	Key Features	Dependencies / Integration
Primus-LM	End-to-end training framework	- Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Turbo and SaFE	- Can invoke Primus-Turbo kernels and modules - Runs on top of Primus-SaFE for stable scheduling
Primus-Turbo	High-performance operators & modules	- Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.) - Modular design, directly pluggable into Primus-LM - Optimized for different architectures and precisions	- Built on AITER, CK, hipBLASLt, Triton and other operator libraries - Can be enabled via configuration inside Primus-LM
Primus-SaFE	Stability & platform layer	- Cluster sanity check and benchmarking - Kubernets scheduling with topology awareness - Fault tolerance - Stability enhancements	- Building a training platform based on the K8s and Slurm ecosystem

Key Features

🚀 High-Availability, High-Performance Infrastructure

Rapid Deployment: Automated Kubernetes cluster provisioning on bare-metal servers
Unified Storage: JuiceFS for high-throughput distributed file system with client-side caching
Secure Registry: Harbor for trusted container image management with RBAC
Scalable Gateway: Higress API gateway for high-concurrency user-facing services

🧠 Intelligent Scheduling and Resilience

Multi-Priority Queuing: Preemption support for urgent experiments
Automatic Failover: Resume training from checkpoints when nodes or GPUs fail
Topology-Aware Placement: Network-locality optimization for distributed training
Gang Scheduling: Co-schedule interdependent pods for distributed workloads
Health Validation: Preflight checks before scheduling production workloads

📊 Comprehensive Monitoring and Insight

Cluster-Wide Metrics: Real-time GPU, CPU, memory, network, and I/O telemetry
Job-Level Tracking: Training progress, throughput, loss curves, and checkpoint events
Interactive Dashboards: Grafana-based visualization with custom alerts
Root Cause Analysis: Correlated metrics and logs for quick issue diagnosis

✅ Preflight Validation and Performance Consistency

Hardware Diagnostics: Verify GPUs, drivers, and network interfaces
Micro-Benchmarks: Standard AI operations to detect underperforming nodes
Trial Runs: Small-scale training jobs to validate end-to-end functionality
Performance Baselines: Ensure all nodes meet minimum performance standards

📈 Extreme Scalability

Tested on clusters from a few GPUs to tens of thousands of GPUs
Architecture designed to handle 100,000+ GPU accelerators
Scales from small lab setups to massive enterprise deployments

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Quick Start

Follow these steps to deploy the complete Primus-SaFE platform on your AMD GPU cluster:

1. Clone the Repository

git clone https://github.com/AMD-AGI/Primus-SaFE.git
cd Primus-SaFE

2. Bootstrap the Kubernetes Cluster

cd Bootstrap

# Edit hosts.ini to specify your cluster nodes and roles
vim hosts.ini

# Run the bootstrap script to deploy Kubernetes
bash bootstrap.sh

This deploys a production-grade Kubernetes cluster with:

High-availability control plane and etcd
JuiceFS distributed storage
Harbor container registry
Higress API gateway

See Bootstrap/README.md for detailed configuration options.

3. Deploy Observability with Primus-Lens

cd ../Lens/bootstrap

# Install Primus-Lens monitoring and logging components
bash install.sh

This installs:

VictoriaMetrics for time-series metrics storage
Grafana for dashboards and visualization
OpenSearch for log aggregation
Custom exporters for GPU, network, and workload metrics

See Lens/README.md for configuration details.

4. Install the Primus-SaFE Platform Layer

cd ../../SaFE/bootstrap

# Deploy Primus-SaFE stability and scheduling components
bash install.sh

This installs:

API Server for unified management interface
Job Manager for workload lifecycle management
Resource Manager for cluster and node management
Node Agent for health monitoring
Webhooks for request validation and modification

See SaFE/README.md for detailed installation guide.

5. (Optional) Run Health Checks with Primus-Bench

cd ../../Bench

# Configure benchmark settings
vim config.sh

# Edit hosts.ini to specify nodes to benchmark
vim hosts.ini

# Run comprehensive health checks and benchmarks
bash run_bare_metal.sh

This runs:

SSH connectivity and configuration validation
Node hardware and system health checks
Network performance and reliability tests (with automatic retry and unhealthy node filtering)
I/O performance benchmarks (optional)
Computation-communication overlap measurements
Kernel launch overhead analysis
Generates detailed health reports with pass/fail node inventory

See Bench/README.md for different execution modes (bare-metal, SLURM, local, Kubernetes).

Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Location: Bootstrap/

Primus-Bootstrap automates the deployment of a production-grade Kubernetes cluster on bare-metal servers, provisioning key infrastructure components optimized for AI workloads.

Key Components:

Kubernetes Cluster: High-availability setup using Kubespray with redundant control-plane and etcd
JuiceFS Storage: Distributed file system with metadata/data separation and client-side caching
Harbor Registry: Secure container image management with RBAC and image scanning
Higress Gateway: High-concurrency API gateway with WebAssembly plugin support

Usage:

cd Bootstrap
vim hosts.ini  # Configure your cluster nodes
bash bootstrap.sh

Learn More: Bootstrap/README.md

Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Location: SaFE/

The Primus-SaFE Core extends Kubernetes with AI-specific scheduling and fault tolerance capabilities to maximize throughput and reliability for long-running training jobs.

Core Services:

API Server: Unified interface for resource and workload management, user authentication, and SSH access
Job Manager: Full lifecycle management of PyTorchJob, Job, Deployment with intelligent scheduling, queuing, and automatic retry
Resource Manager: Centralized management of clusters, nodes, workspaces, storage, and operations
Webhooks: Kubernetes admission controller for request validation and resource modification
Node Agent: Node-level monitoring, fault detection, and self-healing capabilities

Key Features:

Multi-priority queues with preemption
Automatic failover and checkpoint resume
Topology-aware placement (via custom scheduler plugins)
Preflight validation with Primus-Bench integration
Multi-tenant workspace isolation

Usage:

cd SaFE/bootstrap
bash install.sh

Learn More: SaFE/README.md

Primus-Lens: Full-Stack Observability & Visualization

Location: Lens/

Primus-Lens provides comprehensive observability across infrastructure and training workloads with real-time metrics, logs, and interactive dashboards.

Components:

Metrics Stack:
- VictoriaMetrics cluster for high-performance time-series storage
- Custom exporters: GPU resources, network statistics, node hardware, workloads
- VMAgent for metrics collection and aggregation
Logging Stack:
- OpenSearch for distributed log storage and search
- Fluent Bit for log collection and forwarding
- Structured logging with correlation to metrics
Visualization:
- Grafana with pre-built dashboards for cluster, node, GPU, and job metrics
- Custom alerts and notifications
- Training toolkit integration for job-level telemetry
Storage:
- PostgreSQL for metadata and configuration storage

Key Metrics:

GPU utilization, memory, temperature, power
Network throughput, latency, packet loss, RDMA statistics
Storage I/O, filesystem performance
Training progress, throughput, loss curves, checkpoint timing
Pod lifecycle, resource allocation, scheduling latency

Usage:

cd Lens/bootstrap
bash install.sh

Configuration: Edit bootstrap/manifests/*.yaml.tpl templates before installation.

Primus-Bench: Node Health Checks and Performance Benchmarking

Location: Bench/

Primus-Bench provides comprehensive node validation and performance benchmarking to ensure every node in the cluster meets baseline standards before running production training workloads. It intelligently filters out unhealthy nodes and generates detailed health reports.

Test Workflow:

SSH Preflight:
- Establishes SSH connectivity across all nodes
- Synchronizes SSH keys and configurations
- Validates node accessibility
Node Health Checks:
- Hardware validation (GPUs, ROCm drivers, memory)
- System configuration verification
- Environment variable validation
- Network interface verification
- RDMA device availability checks
Network Diagnostics:
- Multi-node all-reduce performance tests
- All-to-all communication tests
- InfiniBand bandwidth validation (ib_write_bw)
- Automatic retry with unhealthy node removal
- Network topology validation
Performance Benchmarks:
- I/O Benchmarks (optional): FIO and IOR tests for storage performance
- Computation-Communication Overlap: Measures ability to overlap compute and communication
- Kernel Launch Overhead: Evaluates GPU kernel dispatch latency
- Results exported as JSON for analysis
Health Report Generation:
- Comprehensive node status summary
- Failed nodes categorization (node check vs network check)
- Healthy nodes inventory for subsequent operations
- Detailed logs for troubleshooting

Execution Modes:

Bare Metal Mode: Direct execution on bare-metal servers with Docker installation
Local Mode: Container-based execution for single or multi-node setups
SLURM Mode: Integration with SLURM job scheduler for HPC environments
Kubernetes Mode: PyTorchJob-based execution in Kubernetes clusters

Usage:

cd Bench

# Configure your environment (edit as needed)
vim config.sh

# Bare metal mode with automatic Docker installation
vim hosts.ini  # Configure target nodes
bash run_bare_metal.sh

# SLURM mode (with automatic resource allocation)
NNODES=4 bash run_slurm.sh

# SLURM mode (within existing allocation)
bash run_slurm.sh --no-allocate

# Local mode (for testing on current node)
bash run_local.sh

# Kubernetes mode
kubectl apply -f kubernetes/pytorchjob.yaml

Output Structure:

outputs/
└── YYYY-MM-DD_HH-MM-SS/
    ├── primusbench.log          # Main execution log
    ├── preflight_node.log        # Node check details
    ├── preflight_network.log     # Network check results
    ├── bench_report.txt          # Summary health report
    ├── overlap_results.json      # CCO benchmark results
    ├── kernel_overhead_results.json  # Kernel launch metrics
    └── <node_name>/              # Per-node logs and results
        ├── cco.log
        └── kernel_launch.log

Configuration: All settings are centralized in config.sh, including container images, network interfaces, GPU settings, and benchmark parameters.

Learn More: Bench/README.md

Scheduler Plugins: Advanced Scheduling Capabilities

Location: Scheduler-Plugins/

Custom Kubernetes scheduler plugins that extend the default scheduler with advanced capabilities tailored for AI workloads.

Available Plugins:

TopologyIPSort:
- IP-based node sorting for network locality
- Pod group support for gang scheduling
- Priority-based queue sorting
- Co-scheduling for distributed workloads
- Kubeflow training workload integration

Extension Points:

Score Plugin: Node evaluation and scoring
Queue Sort Plugin: Custom pod ordering
Permit Plugin: Co-scheduling and admission control
Post Filter Plugin: Diagnostics and failure handling

Usage:

cd Scheduler-Plugins

# Install via Helm
helm install scheduler-plugins manifests/charts/scheduler-plugins/

# Use in pod specifications
spec:
  schedulerName: custom-scheduler

Learn More: Scheduler-Plugins/README.md

Roadmap

Primus-SaFE is under active development. Planned enhancements include:

Enhanced AMD Hardware Support:
- MI450 series GPU integration
- Latest ROCm stack compatibility
- 400 Gbps AI NIC optimization
Advanced Fault Tolerance:
- Process-level failover mechanisms
- Asynchronous checkpointing
- Redundant training processes
- Graceful degradation strategies
Agentic Platform Automation:
- Multi-agent systems (LangGraph, CrewAI) for cluster operations
- Natural-language cluster management
- Automated deployment and optimization
- Self-healing and self-tuning capabilities
AI-Powered Operations:
- Predictive failure detection
- Automated performance optimization
- Intelligent resource allocation

License

This project is licensed under the Apache License 2.0. See the LICENSE file for full details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Primus-SaFE

Overview

🧩 Primus Product Matrix

Table of Contents

Key Features

🚀 High-Availability, High-Performance Infrastructure

🧠 Intelligent Scheduling and Resilience

📊 Comprehensive Monitoring and Insight

✅ Preflight Validation and Performance Consistency

📈 Extreme Scalability

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Quick Start

1. Clone the Repository

2. Bootstrap the Kubernetes Cluster

3. Deploy Observability with Primus-Lens

4. Install the Primus-SaFE Platform Layer

5. (Optional) Run Health Checks with Primus-Bench

Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Primus-Lens: Full-Stack Observability & Visualization

Primus-Bench: Node Health Checks and Performance Benchmarking

Scheduler Plugins: Advanced Scheduling Capabilities

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 10

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 443 Commits
.github/workflows		.github/workflows
Bench		Bench
Bootstrap		Bootstrap
Lens		Lens
SaFE		SaFE
Scheduler-Plugins		Scheduler-Plugins
docs/images		docs/images
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml

License

AMD-AGI/Primus-SaFE

Folders and files

Latest commit

History

Repository files navigation

Primus-SaFE

Overview

🧩 Primus Product Matrix

Table of Contents

Key Features

🚀 High-Availability, High-Performance Infrastructure

🧠 Intelligent Scheduling and Resilience

📊 Comprehensive Monitoring and Insight

✅ Preflight Validation and Performance Consistency

📈 Extreme Scalability

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Quick Start

1. Clone the Repository

2. Bootstrap the Kubernetes Cluster

3. Deploy Observability with Primus-Lens

4. Install the Primus-SaFE Platform Layer

5. (Optional) Run Health Checks with Primus-Bench

Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Primus-Lens: Full-Stack Observability & Visualization

Primus-Bench: Node Health Checks and Performance Benchmarking

Scheduler Plugins: Advanced Scheduling Capabilities

Roadmap

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 10

Uh oh!

Languages

Packages