diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 00000000..e130a32e Binary files /dev/null and b/.DS_Store differ diff --git a/README.md b/README.md index a0588a5a..1d3366b6 100644 --- a/README.md +++ b/README.md @@ -519,7 +519,8 @@ use_cpu: false ```bash deepspeed --num_gpus 8 src/train_bash.py \ - --deepspeed ds_config.json \ + --deepspeed ds_config.json \ + --ddp_timeout 180000000 \ # If the training data is too large, it is recommended to add the ddp_timeout command line option to prevent NCCL errors. ... # arguments (same as above) ``` diff --git a/README_zh.md b/README_zh.md index 24ba3e12..594dc651 100644 --- a/README_zh.md +++ b/README_zh.md @@ -519,7 +519,9 @@ use_cpu: false ```bash deepspeed --num_gpus 8 src/train_bash.py \ --deepspeed ds_config.json \ + --ddp_timeout 180000000 \ # 如训练数据过大,建议加上ddp_timeout命令行,防止nccl报错 ... # 参数同上 + ```
使用 DeepSpeed ZeRO-2 进行全参数训练的 ds_config.json 示例