feat: Add MLflow artifact upload for traces and logs #440

gphuang · 2025-12-18T09:10:45Z

feat: Add MLflow artifact upload for traces and logs

Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.

Features

Upload PyTorch profiler trace files to MLflow artifacts/traces/
Upload training log files to MLflow artifacts/logs/
Unique timestamp-based output directories for multi-node consistency
Pass MLflow environment variables through Docker container

Config Options

mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow

Files Changed

primus/backends/megatron/training/mlflow_artifacts.py - New file with trace/log collection and upload functions
primus/backends/megatron/training/global_vars.py - Add upload_mlflow_artifacts() wrapper
primus/modules/trainer/megatron/trainer.py - Integrate artifact upload before MLflow run ends
primus/configs/modules/megatron/primus_megatron_module.yaml - Add config options
examples/run_pretrain.sh - Add timestamp-based output directories
examples/run_slurm_pretrain.sh - Share timestamp across nodes for multi-node runs
examples/run_local_pretrain.sh - Pass MLflow environment variables to container

Usage

When MLflow is enabled, artifacts are automatically uploaded at the end of training:

Trace files from tensorboard_dir → MLflow artifacts/traces/
Log files from exp_root_path/logs/ → MLflow artifacts/logs/

- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container

Copilot

Pull request overview

This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.

Key changes:

New artifact upload module with functions to collect and upload trace/log files to MLflow
Integration of artifact uploads before MLflow run completion in the trainer
Configuration options to control trace and log uploads (defaulting to enabled)
Shell script improvements for timestamp-based output directories with multi-node consistency

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
primus/backends/megatron/training/mlflow_artifacts.py	New module implementing trace/log file discovery and MLflow artifact upload functionality
primus/backends/megatron/training/global_vars.py	Adds global variable for exp_root_path and wrapper function for artifact uploads
primus/modules/trainer/megatron/trainer.py	Integrates artifact upload calls before MLflow run termination in two exit paths
primus/configs/modules/megatron/primus_megatron_module.yaml	Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true)
examples/run_slurm_pretrain.sh	Implements timestamp-based output directory naming and exports timestamp for multi-node consistency
examples/run_pretrain.sh	Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message
examples/run_local_pretrain.sh	Adds MLflow environment variables and Primus path variables to Docker container environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_slurm_pretrain.sh

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

examples/run_pretrain.sh

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot · 2025-12-18T10:20:26Z

@gphuang I've opened a new pull request, #441, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/modules/trainer/megatron/trainer.py

The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

gphuang · 2026-01-14T11:26:44Z

@wenxie-amd Could you review? Thanks.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/global_vars.py

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_slurm_pretrain.sh

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_pretrain.sh

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Copilot · 2026-01-22T08:33:10Z

examples/run_pretrain.sh

+export PRIMUS_USER=""
+
+mkdir -p "$LOG_DIR"
+TRAIN_LOG="${LOG_DIR}/log_mp_pretrain.txt"


TRAIN_LOG is now always set to ${LOG_DIR}/log_mp_pretrain.txt, removing the previous ability to override the log location via the TRAIN_LOG environment variable. To avoid breaking existing workflows, keep the override pattern (e.g., only set a default when TRAIN_LOG is unset).

Suggested change

TRAIN_LOG="${LOG_DIR}/log_mp_pretrain.txt"

TRAIN_LOG="${TRAIN_LOG:-${LOG_DIR}/log_mp_pretrain.txt}"

Copilot · 2026-01-22T08:33:10Z

examples/run_slurm_pretrain.sh

+# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
+MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)


MODEL_NAME falls back to unknown when EXP is unset, but run_local_pretrain.sh provides a default EXP. This can lead to confusing output directories (e.g., unknown_<ts>) for users relying on defaults. Consider defaulting EXP here as well (or deriving MODEL_NAME after applying the same default).

Suggested change

# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)

MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)

# Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names

if [[ -z "${EXP:-}" ]]; then

export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml"

fi

# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)

MODEL_NAME=$(basename "${EXP}" .yaml)

Copilot · 2026-01-22T08:33:11Z

examples/run_local_pretrain.sh

+    --env PRIMUS_WORKSPACE \
+    --env PRIMUS_EXP_NAME \
+    --env TIMESTAMP \
+    --env LOG_DIR \
+    --env PRIMUS_TEAM \
+    --env PRIMUS_USER \


ENV_ARGS already forwards all PRIMUS_ variables into the container (env | grep "^PRIMUS_"), so explicitly passing --env PRIMUS_WORKSPACE/PRIMUS_EXP_NAME/PRIMUS_TEAM/PRIMUS_USER again is redundant and can be confusing to maintain. Prefer relying on the PRIMUS_ pass-through and keep explicit --env only for non-PRIMUS variables like TIMESTAMP/LOG_DIR.

Suggested change

--env PRIMUS_WORKSPACE \

--env PRIMUS_EXP_NAME \

--env TIMESTAMP \

--env LOG_DIR \

--env PRIMUS_TEAM \

--env PRIMUS_USER \

--env TIMESTAMP \

--env LOG_DIR \

Copilot · 2026-01-22T08:33:11Z

primus/backends/megatron/training/mlflow_artifacts.py

+import os
+from typing import Optional
+
+from primus.modules.module_utils import log_rank_0, warning_rank_0


mlflow_artifacts.py logs via log_rank_0/warning_rank_0, but MLflow is initialized on rank world_size - 1 (see global_vars._set_mlflow_writer), so these messages (including upload failures) will be suppressed in typical distributed runs. Use a rank filter that matches the MLflow rank (e.g., log_rank_last), or add/route warnings to a warning_rank_last/log_rank_all path so upload failures are visible.

Suggested change

from primus.modules.module_utils import log_rank_0, warning_rank_0

from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0

Copilot · 2026-01-22T08:33:11Z

primus/backends/megatron/training/mlflow_artifacts.py

+def upload_artifacts_to_mlflow(
+    mlflow_writer,
+    tensorboard_dir: Optional[str] = None,
+    exp_root_path: Optional[str] = None,
+    upload_traces: bool = True,
+    upload_logs: bool = True,
+) -> dict:
+    """


Artifact upload behavior is new but currently has no unit tests. Consider adding tests that create a temp tensorboard_dir/exp_root_path with sample *.pt.trace.json(.gz) and *.log files and verify upload_artifacts_to_mlflow() calls mlflow_writer.log_artifact with the expected artifact_path subdirectories.

Copilot AI review requested due to automatic review settings December 18, 2025 09:10

Copilot started reviewing on behalf of gphuang December 18, 2025 09:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang requested a review from Copilot December 18, 2025 10:10

Copilot started reviewing on behalf of gphuang December 18, 2025 10:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

Copilot AI mentioned this pull request Dec 18, 2025

Move MLflow import to function scope to avoid import-time dependencies #441

Closed

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

13dfa81

Copilot AI review requested due to automatic review settings December 18, 2025 10:30

Copilot started reviewing on behalf of gphuang December 18, 2025 10:31 View session

gphuang force-pushed the feat/6-enable-mlflow-uploading branch from 3c149be to 13dfa81 Compare December 18, 2025 10:33

Update primus/modules/trainer/megatron/trainer.py

1f2e136

Co-authored-by: Copilot <[email protected]>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

Update examples/run_pretrain.sh

d30b920

Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings December 18, 2025 10:37

Copilot started reviewing on behalf of gphuang December 18, 2025 10:38 View session

Update primus/backends/megatron/training/mlflow_artifacts.py

b2da61b

Co-authored-by: Copilot <[email protected]>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

gphuang mentioned this pull request Dec 18, 2025

feat: Add TraceLens integration for trace analysis with MLflow upload #439

Open

gphuang and others added 2 commits December 18, 2025 15:15

Merge branch 'main' into feat/6-enable-mlflow-uploading

476c05d

Copilot AI review requested due to automatic review settings December 19, 2025 08:26

gphuang marked this pull request as ready for review December 19, 2025 08:26

gphuang requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners December 19, 2025 08:26

Copilot started reviewing on behalf of gphuang December 19, 2025 08:27 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

Merge branch 'main' into feat/6-enable-mlflow-uploading

b04cf26

Minor fix: lint format

2b413d0

Copilot AI review requested due to automatic review settings January 15, 2026 08:51

Copilot started reviewing on behalf of gphuang January 15, 2026 08:52 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

gphuang added 2 commits January 15, 2026 08:59

Merge branch 'main' into feat/6-enable-mlflow-uploading

c23c754

minor fix

d7417d8

Copilot AI review requested due to automatic review settings January 15, 2026 10:24

Copilot started reviewing on behalf of gphuang January 15, 2026 10:25 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

primus/backends/megatron/training/global_vars.py Show resolved Hide resolved

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

examples/run_slurm_pretrain.sh Show resolved Hide resolved

gphuang added 2 commits January 16, 2026 12:34

Merge branch 'main' into feat/6-enable-mlflow-uploading

5e01c59

Merge branch 'main' into feat/6-enable-mlflow-uploading

7488ccd

Copilot AI review requested due to automatic review settings January 19, 2026 07:51

Copilot started reviewing on behalf of gphuang January 19, 2026 07:52 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

gphuang added 2 commits January 20, 2026 09:56

Merge branch 'main' into feat/6-enable-mlflow-uploading

f5b2a1c

Merge branch 'main' into feat/6-enable-mlflow-uploading

c2999a9

Copilot AI review requested due to automatic review settings January 22, 2026 08:20

Copilot started reviewing on behalf of gphuang January 22, 2026 08:21 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Merge branch 'main' into feat/6-enable-mlflow-uploading

e4c516c

	TRAIN_LOG="${LOG_DIR}/log_mp_pretrain.txt"
	TRAIN_LOG="${TRAIN_LOG:-${LOG_DIR}/log_mp_pretrain.txt}"

		# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
		MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)

-# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
-MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)
+# Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names
+if [[ -z "${EXP:-}" ]]; then
+    export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml"
+fi
+# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
+MODEL_NAME=$(basename "${EXP}" .yaml)

	from primus.modules.module_utils import log_rank_0, warning_rank_0
	from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0

feat: Add MLflow artifact upload for traces and logs #440

Are you sure you want to change the base?

feat: Add MLflow artifact upload for traces and logs #440

Uh oh!

Conversation

gphuang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!