-
Notifications
You must be signed in to change notification settings - Fork 25
feat: Add MLflow artifact upload for traces and logs #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.
Key changes:
- New artifact upload module with functions to collect and upload trace/log files to MLflow
- Integration of artifact uploads before MLflow run completion in the trainer
- Configuration options to control trace and log uploads (defaulting to enabled)
- Shell script improvements for timestamp-based output directories with multi-node consistency
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| primus/backends/megatron/training/mlflow_artifacts.py | New module implementing trace/log file discovery and MLflow artifact upload functionality |
| primus/backends/megatron/training/global_vars.py | Adds global variable for exp_root_path and wrapper function for artifact uploads |
| primus/modules/trainer/megatron/trainer.py | Integrates artifact upload calls before MLflow run termination in two exit paths |
| primus/configs/modules/megatron/primus_megatron_module.yaml | Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true) |
| examples/run_slurm_pretrain.sh | Implements timestamp-based output directory naming and exports timestamp for multi-node consistency |
| examples/run_pretrain.sh | Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message |
| examples/run_local_pretrain.sh | Adds MLflow environment variables and Primus path variables to Docker container environment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3c149be to
13dfa81
Compare
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@wenxie-amd Could you review? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
| export PRIMUS_USER="" | ||
|
|
||
| mkdir -p "$LOG_DIR" | ||
| TRAIN_LOG="${LOG_DIR}/log_mp_pretrain.txt" |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRAIN_LOG is now always set to ${LOG_DIR}/log_mp_pretrain.txt, removing the previous ability to override the log location via the TRAIN_LOG environment variable. To avoid breaking existing workflows, keep the override pattern (e.g., only set a default when TRAIN_LOG is unset).
| TRAIN_LOG="${LOG_DIR}/log_mp_pretrain.txt" | |
| TRAIN_LOG="${TRAIN_LOG:-${LOG_DIR}/log_mp_pretrain.txt}" |
| # Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain) | ||
| MODEL_NAME=$(basename "${EXP:-unknown}" .yaml) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_NAME falls back to unknown when EXP is unset, but run_local_pretrain.sh provides a default EXP. This can lead to confusing output directories (e.g., unknown_<ts>) for users relying on defaults. Consider defaulting EXP here as well (or deriving MODEL_NAME after applying the same default).
| # Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain) | |
| MODEL_NAME=$(basename "${EXP:-unknown}" .yaml) | |
| # Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names | |
| if [[ -z "${EXP:-}" ]]; then | |
| export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml" | |
| fi | |
| # Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain) | |
| MODEL_NAME=$(basename "${EXP}" .yaml) |
| --env PRIMUS_WORKSPACE \ | ||
| --env PRIMUS_EXP_NAME \ | ||
| --env TIMESTAMP \ | ||
| --env LOG_DIR \ | ||
| --env PRIMUS_TEAM \ | ||
| --env PRIMUS_USER \ |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENV_ARGS already forwards all PRIMUS_ variables into the container (env | grep "^PRIMUS_"), so explicitly passing --env PRIMUS_WORKSPACE/PRIMUS_EXP_NAME/PRIMUS_TEAM/PRIMUS_USER again is redundant and can be confusing to maintain. Prefer relying on the PRIMUS_ pass-through and keep explicit --env only for non-PRIMUS variables like TIMESTAMP/LOG_DIR.
| --env PRIMUS_WORKSPACE \ | |
| --env PRIMUS_EXP_NAME \ | |
| --env TIMESTAMP \ | |
| --env LOG_DIR \ | |
| --env PRIMUS_TEAM \ | |
| --env PRIMUS_USER \ | |
| --env TIMESTAMP \ | |
| --env LOG_DIR \ |
| import os | ||
| from typing import Optional | ||
|
|
||
| from primus.modules.module_utils import log_rank_0, warning_rank_0 |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlflow_artifacts.py logs via log_rank_0/warning_rank_0, but MLflow is initialized on rank world_size - 1 (see global_vars._set_mlflow_writer), so these messages (including upload failures) will be suppressed in typical distributed runs. Use a rank filter that matches the MLflow rank (e.g., log_rank_last), or add/route warnings to a warning_rank_last/log_rank_all path so upload failures are visible.
| from primus.modules.module_utils import log_rank_0, warning_rank_0 | |
| from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0 |
| def upload_artifacts_to_mlflow( | ||
| mlflow_writer, | ||
| tensorboard_dir: Optional[str] = None, | ||
| exp_root_path: Optional[str] = None, | ||
| upload_traces: bool = True, | ||
| upload_logs: bool = True, | ||
| ) -> dict: | ||
| """ |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifact upload behavior is new but currently has no unit tests. Consider adding tests that create a temp tensorboard_dir/exp_root_path with sample *.pt.trace.json(.gz) and *.log files and verify upload_artifacts_to_mlflow() calls mlflow_writer.log_artifact with the expected artifact_path subdirectories.
feat: Add MLflow artifact upload for traces and logs
Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.
Features
artifacts/traces/artifacts/logs/Config Options
mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow
Files Changed
primus/backends/megatron/training/mlflow_artifacts.py- New file with trace/log collection and upload functionsprimus/backends/megatron/training/global_vars.py- Addupload_mlflow_artifacts()wrapperprimus/modules/trainer/megatron/trainer.py- Integrate artifact upload before MLflow run endsprimus/configs/modules/megatron/primus_megatron_module.yaml- Add config optionsexamples/run_pretrain.sh- Add timestamp-based output directoriesexamples/run_slurm_pretrain.sh- Share timestamp across nodes for multi-node runsexamples/run_local_pretrain.sh- Pass MLflow environment variables to containerUsage
When MLflow is enabled, artifacts are automatically uploaded at the end of training:
tensorboard_dir→ MLflowartifacts/traces/exp_root_path/logs/→ MLflowartifacts/logs/