Describe the bug:
There are two related bugs:
- When I start multinode traning from a HF checkpoint of a Llama 70B model, it fails because of an nccl 600 seconds timeout when loading the checkpoint
- When I finish multinode training of Llama 70B, a similar error hapens when saving a HF checkpoint.
An example of config and logs is attached.
ws_64.6ca44755_logs.zip
Describe how to reproduce:
I encountered it during OMT training. A minimal reproduction example has to be figured out.
Describe the expected behavior:
No errors
Environment:
At the very least, specify the versions of fairseq2, PyTorch, Python, and CUDA along with your operating system and, if relevant, GPU model.
RSC or AWS-SC, python3.12, fairseq2==0.6, torch 2.8.0, CUDA 12.8
Additional Context:
Add any other context about the bug here.