How to continue fine-tuning a DPA-3 model from intermediate checkpoint? #5163
-
|
Has anyone tried hot-restarting fine-tuning process of a DPA-3 model? What would be the correct command? My original training command was: dp --pt train input.json --finetune DPA-3.1-3M.pt --model-branch Omat24 Now, since my fine-tuning task aborted due to hardware malfunction, I need to hot restart from model-402000.ckpt.pt checkpoint. To resume, should I modify my command as: dp --pt train input.json --finetune --restart model-xxxx.ckpt.pt --model-branch Omat24 Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Hi @qchempku2017! I'm Dosu and I’m helping the deepmd-kit team. To resume fine-tuning your DPA-3 model from an intermediate checkpoint, you should use only the --restart flag with your checkpoint file—do not combine --finetune and --restart, as they are mutually exclusive and will cause an error if used together. The correct command is: If your fine-tuning run generated an out.json file, use that as your input instead of the original input.json to ensure all model parameters match the checkpoint: This approach is the standard and recommended way to hot-restart fine-tuning after an interruption. The --restart flag will continue training from the checkpoint, preserving all learned weights and fine-tuning progress so far. You do not need to specify --finetune again; that is only for starting a new fine-tuning run from a pretrained model, not for resuming an interrupted one. For more details, see this discussion and the documentation. If you run into errors about model parameter mismatches, make sure your input JSON matches the model structure in the checkpoint. For multi-task fine-tuning, the model section must be consistent with the pretrained model; if needed, you can extract model parameters from the pretrained .pt file and update your input JSON accordingly (details here). Let me know if you need troubleshooting tips for specific errors! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
@qchempku2017 You do not need to add where |
Beta Was this translation helpful? Give feedback.
@qchempku2017 You do not need to add
--model-branch Omat24while retarting, since the finetuned checkpoint is already a single-task model just one model branch. So the correct command is:where
--skip-neighbor-statis recommended here when neighbor statistics is unnecessary.