From 832a7841af93a517b2af42298eb39eacfc93c6cd Mon Sep 17 00:00:00 2001 From: p04896573 Date: Sat, 11 May 2024 17:54:49 +0800 Subject: [PATCH 01/28] Update README_LORA.md --- quick_start_clean/readmes/README_LORA.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/quick_start_clean/readmes/README_LORA.md b/quick_start_clean/readmes/README_LORA.md index 0465c13..3efbcee 100644 --- a/quick_start_clean/readmes/README_LORA.md +++ b/quick_start_clean/readmes/README_LORA.md @@ -67,12 +67,10 @@ $CMD ``` ## 合并模型 -训练好的lora delta model一般有两种方式 -- 在直接含有lora的推理代码进行推理 - 将lora delta model参数和original model merge在一起 作为新的模型,但是模型的参数数量并没有增多 - +```python python merge_lora_delta.py --base_path cpm9g-8b-sft.pt --delta_path cpm9g-lora.pt --merge_path cpm9g-8b-sft_with_lora.pt - +``` # lora 推理 From 8c13a9e4659a64a76324ebbddd529ad30e6fca34 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Sat, 11 May 2024 17:55:31 +0800 Subject: [PATCH 02/28] Update README_DISTRIBUTED.md --- quick_start_clean/readmes/README_DISTRIBUTED.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_DISTRIBUTED.md b/quick_start_clean/readmes/README_DISTRIBUTED.md index e5eef76..0361481 100644 --- a/quick_start_clean/readmes/README_DISTRIBUTED.md +++ b/quick_start_clean/readmes/README_DISTRIBUTED.md @@ -20,7 +20,7 @@ CMD="torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rd 接下来,在这两个机器中都执行bash sft_cpm9g_8b.sh,这样就完成一次最简单的多机训练 不过机器多了之后不推荐这种方式 -### slurm 集群多机任务提交 +## slurm 集群多机任务提交 算力平台使用Slurm调度,常用Slurm命令包括: ``` shell From 3129921b553f262c92f9c30616fd3582f7c4d09b Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:33:11 +0800 Subject: [PATCH 03/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index f45b7f7..70b8d02 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -1,4 +1,6 @@ # 九格大模型使用文档 +目录 +- [环境配置](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#环境配置) ## 环境配置: From 1ba3d00e25beb1a184db703e3915aa3bba623a72 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:43:01 +0800 Subject: [PATCH 04/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 26 ++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index 70b8d02..f51e275 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -1,7 +1,15 @@ # 九格大模型使用文档 目录 - [环境配置](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#环境配置) +- [开源模型](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#开源模型) +- [数据](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#数据) +- [模型训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型训练) +- [模型推理](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型推理) +- [开源模型](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#开源模型) +- [分布式多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#分布式多机训练) +- [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) + ## 环境配置: [环境配置、硬件信息](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) @@ -107,7 +115,12 @@ def transform(data, num_sample: int, r: random.Random): - 我们在此文件中指定了数据文件的路径、转换脚本路径等信息,后续训练仅需要系统该文件的路径即可。 ## 模型训练 +模型训练列举了三种训练 +- [pretrain 训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#pretrain 训练) +- [SFT全参数微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#SFT全参数微调训练) +- [LoRA微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#LoRA微调训练) +### pretrain 训练: 模型训练代码的位置:9G-Train/apps/cpm9g/pretrain_cpm9g.py 需要将代码中环境变量设置为您的代码路径: ``` python @@ -115,7 +128,6 @@ def transform(data, num_sample: int, r: random.Random): sys.path.insert(0, "/data/public/CPM-9G/9G-Train") ``` -### pretrain shell脚本: ```shell #! /bin/bash @@ -172,9 +184,8 @@ echo "${CMD}" $CMD ``` -### sft 训练shell 脚本 +### SFT全参数微调训练 ``` shell - export MASTER_ADDR=`hostname` export MASTER_PORT=12345 @@ -220,11 +231,9 @@ OPTS+=" $@" CMD="torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} ${CPM_PATH}/apps/cpm9g/sft_cpm9g.py ${OPTS}" echo "${CMD}" $CMD - ``` -### lora 训练 -[lora 训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_LORA.md) +### [LoRA微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_LORA.md) ## 模型推理 ```python @@ -286,7 +295,7 @@ if __name__ == "__main__": ``` 5 微调训练中,train_iters如何计算? ``` - 回答:因为模型上下文是4096的token数目,通常情况存在训练数据不足4096的长度,所以会对多条数据进行merge,送入模型的数据量会少于1000条 + 回答:因为模型上下文是4096的token数目,通常情况存在训练数据不足4096的长度,所以会对多条数据进行merge,因此送入模型条数要少于实际的数据条数 ``` 6 打印出来的Iter信息有缺失 ``` @@ -310,5 +319,4 @@ datas = [ ## TODO -1 发布最新训练的80B SFT模型 -2 Lora相关的代码更新 \ No newline at end of file +1 发布最新训练的80B SFT模型 \ No newline at end of file From 7045e59e9ce984b70e94be0f90ff1ae8b830bd56 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:45:06 +0800 Subject: [PATCH 05/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index f51e275..6b928fe 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -1,15 +1,13 @@ # 九格大模型使用文档 -目录 +## 目录 - [环境配置](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#环境配置) - [开源模型](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#开源模型) -- [数据](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#数据) +- [数据构建](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#数据构建) - [模型训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型训练) - [模型推理](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型推理) -- [开源模型](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#开源模型) -- [分布式多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#分布式多机训练) +- [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#多机训练) - [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) - - + ## 环境配置: [环境配置、硬件信息](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) @@ -17,12 +15,10 @@ ## 开源模型 1 目前启元开源了80B的百亿SFT模型,模型的路径:[百亿SFT开源模型](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz) - ## 使用教程 - 为了帮助您快速了解CPM-9G的使用,我们准备了一个快速入门教程,目标是基于CPM-9G基座模型通过指令微调的方式构建一个Chat模型。 -## 数据 +## 数据构建 本教程使用的数据是Alpaca Zh,一个开源中文指令微调数据集。[数据集](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data_zh.json) @@ -277,8 +273,7 @@ if __name__ == "__main__": main() ``` -## 分布式多机训练 -[分布式多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_DISTRIBUTED.md) +## [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_DISTRIBUTED.md) ## FAQs From 3b07629bcc6a228a22d7140aa52ad8c5214a1195 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:48:31 +0800 Subject: [PATCH 06/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index 6b928fe..ac4e056 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -5,7 +5,7 @@ - [数据构建](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#数据构建) - [模型训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型训练) - [模型推理](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型推理) -- [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#多机训练) +- [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_DISTRIBUTED.md) - [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) ## 环境配置: @@ -112,11 +112,11 @@ def transform(data, num_sample: int, r: random.Random): ## 模型训练 模型训练列举了三种训练 -- [pretrain 训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#pretrain 训练) +- [pretrain训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#pretrain训练) - [SFT全参数微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#SFT全参数微调训练) -- [LoRA微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#LoRA微调训练) +- [LoRA微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_LORA.md) -### pretrain 训练: +### pretrain训练: 模型训练代码的位置:9G-Train/apps/cpm9g/pretrain_cpm9g.py 需要将代码中环境变量设置为您的代码路径: ``` python @@ -229,8 +229,6 @@ echo "${CMD}" $CMD ``` -### [LoRA微调训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_LORA.md) - ## 模型推理 ```python import os @@ -273,8 +271,6 @@ if __name__ == "__main__": main() ``` -## [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_DISTRIBUTED.md) - ## FAQs 常见问题汇总,持续补充ing From 27578564cbb5b6e32437dc46f2fe8e5437587306 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:55:56 +0800 Subject: [PATCH 07/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index ac4e056..e559635 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -6,8 +6,7 @@ - [模型训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型训练) - [模型推理](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型推理) - [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_DISTRIBUTED.md) -- [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) - +- [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) ## 环境配置: [环境配置、硬件信息](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) @@ -179,7 +178,7 @@ CMD="torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rd echo "${CMD}" $CMD ``` - +``` ### SFT全参数微调训练 ``` shell export MASTER_ADDR=`hostname` From 599b82fc89fe4f484dd18704cbeef7b0bcfa249a Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:57:05 +0800 Subject: [PATCH 08/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index e559635..ac3e562 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -178,7 +178,7 @@ CMD="torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rd echo "${CMD}" $CMD ``` -``` + ### SFT全参数微调训练 ``` shell export MASTER_ADDR=`hostname` From ca4b5a9067e3e071007032fd559181b474d47416 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 13 May 2024 10:58:17 +0800 Subject: [PATCH 09/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index ac3e562..6328c18 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -6,16 +6,16 @@ - [模型训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型训练) - [模型推理](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#模型推理) - [多机训练](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_DISTRIBUTED.md) -- [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) -## 环境配置: +- [FAQs](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ALL.md?tab=readme-ov-file#FAQs) +帮助您快速了解CPM-9G的使用,我们准备了一个快速入门教程,目标是基于CPM-9G基座模型通过指令微调的方式构建一个Chat模型。 +## 环境配置: [环境配置、硬件信息](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) ## 开源模型 1 目前启元开源了80B的百亿SFT模型,模型的路径:[百亿SFT开源模型](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz) -## 使用教程 -为了帮助您快速了解CPM-9G的使用,我们准备了一个快速入门教程,目标是基于CPM-9G基座模型通过指令微调的方式构建一个Chat模型。 + ## 数据构建 From f0d3e87e16326d342a930f01d9c3c6f3543b38d8 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Tue, 14 May 2024 17:57:58 +0800 Subject: [PATCH 10/28] Update README_DISTRIBUTED.md --- quick_start_clean/readmes/README_DISTRIBUTED.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_DISTRIBUTED.md b/quick_start_clean/readmes/README_DISTRIBUTED.md index 0361481..4043184 100644 --- a/quick_start_clean/readmes/README_DISTRIBUTED.md +++ b/quick_start_clean/readmes/README_DISTRIBUTED.md @@ -114,6 +114,11 @@ for i in {1..3};do done ``` +## dockers上的多机提交任务 +dockers 容器上的多机任务和在主机上是相同的,只需要再其基础上满足两个要求 +- 在每个机器上拉取同样的docker和激活同样的训练环境,在docker共享的路径、数据、代码都一致 +- 在docker启动的时候保障 --network=host,和主机共享网络通信,只要机器之间能通信,在dockers中也可以通信和训练 + #### TODOs -1 完善dockers、K8s集群的分布式多机任务训练 \ No newline at end of file +1 完善K8s集群的分布式多机任务训练 \ No newline at end of file From 977a56b9b06486bcfb09cc01ad254f3ec05b9a92 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Wed, 15 May 2024 15:55:54 +0800 Subject: [PATCH 11/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index 6328c18..b7de2d1 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -280,8 +280,8 @@ if __name__ == "__main__": 3 尽量避免在window机器下修改脚本,window中的编码和格式linux是有差别的,容易在脚本执行中报错 4 SFT如何调参训练 ``` - 回答:如果数据量少于10w条,多训练几个epoch,把学习率调低一些,比如说5e-6等; - 数据量很多呢,训练最多2个epoch足够,注意过拟合的问题 + 回答:如果数据量少于10w条,全参数微调的时候多训练几个epoch,把学习率调低一些,比如说5e-6等;更建议使用lora 微调的方式 + 数据量很多呢,比如说达到百万级别,那可以选择全参数微调,但训练最多2个epoch足够,注意过拟合的问题 ``` 5 微调训练中,train_iters如何计算? ``` @@ -295,10 +295,12 @@ if __name__ == "__main__": ``` 回答:不需要,参数中出现的val_datasets忽略即可 ``` -8 Lora 推理:需要进行merge 模型后预测,五一后release该代码 -9 加载模型遇到:invalid header or archive is carrupted,这种一般是模型没有下载完导致的,目前红山上的模型确定是完整的,首先自查自己的模型是否下载成功。 -10 存储模型的时候遇到failed write file data ,一般先检查下文件路径和权限、磁盘空间吧,存储模型基本不会报错 - +8 加载模型遇到:invalid header or archive is carrupted,这种一般是模型没有下载完导致的,目前红山上的模型确定是完整的,首先自查自己的模型是否下载成功。 +9 存储模型的时候遇到failed write file data ,一般先检查下文件路径和权限、磁盘空间吧,存储模型基本不会报错 +10 是否支持图像模态: +``` + 回答:不支持图像模态,仅支持文本模态 +``` ### 数据相关 1 历史对话的传入: ``` json From 115581146a5ccc2eb6af69d4c9566afbf0f9676e Mon Sep 17 00:00:00 2001 From: p04896573 Date: Wed, 15 May 2024 15:57:03 +0800 Subject: [PATCH 12/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index 881bade..764a8e7 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -53,6 +53,10 @@ docker stop [CONTAINER_ID] ```shell docker ps ``` +### 环境安装 +```shell +pip install tensorboardX +``` ## Conda环境配置 ### 训练环境配置 From edc7c0aa927ebb962c440606c1f4c2c3f44237b4 Mon Sep 17 00:00:00 2001 From: p80239751 Date: Fri, 24 May 2024 20:09:33 +0800 Subject: [PATCH 13/28] Update README.md --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index bb3fbf1..b5c386e 100644 --- a/README.md +++ b/README.md @@ -50,4 +50,8 @@ ## 大语言模型驱动的多智能体协作与演化 -[大语言模型驱动的多智能体协作与演化-PPT](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/%E8%AF%BE%E7%A8%8B%E8%A7%86%E9%A2%91/9.%E5%A4%A7%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E9%A9%B1%E5%8A%A8%E7%9A%84%E5%A4%9A%E6%99%BA%E8%83%BD%E4%BD%93%E5%8D%8F%E4%BD%9C%E4%B8%8E%E6%BC%94%E5%8C%96-PPT.pdf) \ No newline at end of file +[大语言模型驱动的多智能体协作与演化-PPT](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/%E8%AF%BE%E7%A8%8B%E8%A7%86%E9%A2%91/9.%E5%A4%A7%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E9%A9%B1%E5%8A%A8%E7%9A%84%E5%A4%9A%E6%99%BA%E8%83%BD%E4%BD%93%E5%8D%8F%E4%BD%9C%E4%B8%8E%E6%BC%94%E5%8C%96-PPT.pdf) + +## 多模态大模型的发展与未来 + \ No newline at end of file From 0d81f5b1dd27bfaac0fd4179ae7e0c59ce298f15 Mon Sep 17 00:00:00 2001 From: p80239751 Date: Fri, 24 May 2024 20:15:34 +0800 Subject: [PATCH 14/28] Update README.md From db8c0fbb7ab1035581de05e1c7100369704a9bff Mon Sep 17 00:00:00 2001 From: p80239751 Date: Fri, 24 May 2024 20:17:44 +0800 Subject: [PATCH 15/28] Update README.md From c8cbca2fec38f4d6a8b8fa4130a0515f4c5b0f26 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 27 May 2024 14:13:03 +0800 Subject: [PATCH 16/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index 764a8e7..d6b5aff 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -95,6 +95,16 @@ pip installlibcpm-1.0.0-cp38-cp38-linux_x86_64.whl ``` # 硬件资源 +## 推荐配置: +### 千亿大模型 + - 预训练、全参数微调:8 * 512G以上内存,64 * 80G以上显存 + - 高效微调(LoRA)与推理: 512G 以上内存,8 * 80G以上显存 +### 百亿大模型 + - 预训练、全参数微调:2 * 512G以上内存,16 * 80G以上显存 + - 高效微调(LoRA)与推理: 128G 以上内存,2 * 80G以上显存 + +## 极限配置 +最极限的资源配置,仅供参考,在大模型训练中其实并不推荐,因为其效果一般不佳,训练时长也比较久 | 模型 | 资源 | 最小算力 | | :-------- | :----- | :----: | | 百亿模型 |内存 |训练:140G, 推理:1G| @@ -102,7 +112,6 @@ pip installlibcpm-1.0.0-cp38-cp38-linux_x86_64.whl | 千亿模型 |内存 |训练: 200G, 推理:2G| | 千亿模型 |显存 |训练: 8*80G , 推理:4 * 50G| - 另外 - 该表格是百亿、千亿模型需要的最小的资源,batch size为1. - 百亿模型是在单卡A100上测试 From 7698ff9e55a9d75be64bbc91a04d3d0324a936a8 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 27 May 2024 14:14:11 +0800 Subject: [PATCH 17/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 1 + 1 file changed, 1 insertion(+) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index d6b5aff..6cbe9fb 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -105,6 +105,7 @@ pip installlibcpm-1.0.0-cp38-cp38-linux_x86_64.whl ## 极限配置 最极限的资源配置,仅供参考,在大模型训练中其实并不推荐,因为其效果一般不佳,训练时长也比较久 + | 模型 | 资源 | 最小算力 | | :-------- | :----- | :----: | | 百亿模型 |内存 |训练:140G, 推理:1G| From 233ae74aba4e9d5c2117b02cefde257f694912c7 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 27 May 2024 14:14:59 +0800 Subject: [PATCH 18/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 1 + 1 file changed, 1 insertion(+) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index 6cbe9fb..87f4831 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -99,6 +99,7 @@ pip installlibcpm-1.0.0-cp38-cp38-linux_x86_64.whl ### 千亿大模型 - 预训练、全参数微调:8 * 512G以上内存,64 * 80G以上显存 - 高效微调(LoRA)与推理: 512G 以上内存,8 * 80G以上显存 + ### 百亿大模型 - 预训练、全参数微调:2 * 512G以上内存,16 * 80G以上显存 - 高效微调(LoRA)与推理: 128G 以上内存,2 * 80G以上显存 From 614feeeb34b355b711e6cfe22107b8bdcb36dfea Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 27 May 2024 14:15:48 +0800 Subject: [PATCH 19/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index 87f4831..726dacd 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -94,7 +94,7 @@ echo $LD_LIBRARY_PATH2. 安装LibCPM pip installlibcpm-1.0.0-cp38-cp38-linux_x86_64.whl ``` -# 硬件资源 +# 算力资源 ## 推荐配置: ### 千亿大模型 - 预训练、全参数微调:8 * 512G以上内存,64 * 80G以上显存 From edc231fa42a8d7534477fd3bcae8bfa491d53373 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 27 May 2024 14:16:35 +0800 Subject: [PATCH 20/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index b7de2d1..3430678 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -10,7 +10,7 @@ 帮助您快速了解CPM-9G的使用,我们准备了一个快速入门教程,目标是基于CPM-9G基座模型通过指令微调的方式构建一个Chat模型。 ## 环境配置: -[环境配置、硬件信息](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) +[环境配置、算力资源](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) ## 开源模型 1 目前启元开源了80B的百亿SFT模型,模型的路径:[百亿SFT开源模型](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz) From 06f62b1833c9eb1d52c971315248a461f922d947 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Mon, 27 May 2024 14:18:42 +0800 Subject: [PATCH 21/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index 726dacd..d7b286c 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -1,3 +1,6 @@ +#环境配置、算力资源 + + # Docker使用 我们提供可以运行模型训练和推理的docker,便于在新环境下快速使用九格大模型。您也可以使用Conda配置运行环境。Conda配置方式请见下一节。 #### [docker 路径](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/cpmlive-flash-0.0.4.tar) From 366ba76f3178a2bcfbf4f8c2f8f45fd2bdb2eed3 Mon Sep 17 00:00:00 2001 From: p80239751 Date: Wed, 29 May 2024 16:29:26 +0800 Subject: [PATCH 22/28] Update README.md --- README.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/README.md b/README.md index b5c386e..1f410c8 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,4 @@ [大语言模型驱动的多智能体协作与演化-PPT](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/%E8%AF%BE%E7%A8%8B%E8%A7%86%E9%A2%91/9.%E5%A4%A7%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E9%A9%B1%E5%8A%A8%E7%9A%84%E5%A4%9A%E6%99%BA%E8%83%BD%E4%BD%93%E5%8D%8F%E4%BD%9C%E4%B8%8E%E6%BC%94%E5%8C%96-PPT.pdf) - -## 多模态大模型的发展与未来 - \ No newline at end of file + \ No newline at end of file From 35a27616de62a68cb237367e70000843bff94a35 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Tue, 4 Jun 2024 19:08:12 +0800 Subject: [PATCH 23/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index 3430678..51ada42 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -13,7 +13,7 @@ [环境配置、算力资源](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) ## 开源模型 -1 目前启元开源了80B的百亿SFT模型,模型的路径:[百亿SFT开源模型](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz) +1 目前启元开源了80B的百亿SFT模型,模型的路径:[百亿SFT开源模型](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/sft_8b_v2.zip) From 11709c86180d6262e22cf4de46156ccf026569c9 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Fri, 28 Jun 2024 17:02:57 +0800 Subject: [PATCH 24/28] Update README_LORA.md --- quick_start_clean/readmes/README_LORA.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_LORA.md b/quick_start_clean/readmes/README_LORA.md index 3efbcee..7e228a2 100644 --- a/quick_start_clean/readmes/README_LORA.md +++ b/quick_start_clean/readmes/README_LORA.md @@ -60,7 +60,7 @@ OPTS+=" --save-origin-model" OPTS+=" $@" -CMD="torchrun --nnodes=1 --nproc_per_node=2 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} ${CPM_PATH}/apps/cpm9g/sft_cpm9g.py ${OPTS}" +CMD="torchrun --nnodes=1 --nproc_per_node=2 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} ${CPM_PATH}/apps/cpm9g/sft_cpm9g_delta.py ${OPTS}" echo "${CMD}" $CMD From 82fef6ad8badef580718a10d9408c04e02f4db75 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Tue, 2 Jul 2024 19:49:19 +0800 Subject: [PATCH 25/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index d7b286c..00457e4 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -10,10 +10,11 @@ ```shell srun -p gpu1 --nodelist=g2001 -N 1 -n 8 -c 8 --gres=gpu:8 --pty bash module load rootless-docker/default +start_rootless_docker.sh ``` **注意使用bash(不能用zsh)** -start_rootless_docker.sh运行成功的话,此时执行docker ps可以看到当前没有正在运行的容器,如果有正在运行的容器,说明rootless模式没有启动成功,请联系管理员。 +运行成功的话,此时执行docker ps可以看到当前没有正在运行的容器,如果有正在运行的容器,说明rootless模式没有启动成功,请联系管理员。 ### 加载镜像 ```shell From cd6173f382c36e853a5fd48669e6e5ad71f79e01 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Tue, 9 Jul 2024 15:05:54 +0800 Subject: [PATCH 26/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index 51ada42..f09736d 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -13,9 +13,10 @@ [环境配置、算力资源](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) ## 开源模型 -1 目前启元开源了80B的百亿SFT模型,模型的路径:[百亿SFT开源模型](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/sft_8b_v2.zip) - - +1 目前启元开源了80B的百亿SFT模型: + v2版本主要是进行精度指标的优化和对话能力的提升 + [8b_v1](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz) + [8b_v2](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/sft_8b_v2.zip) ## 数据构建 @@ -311,4 +312,4 @@ datas = [ ## TODO -1 发布最新训练的80B SFT模型 \ No newline at end of file +1 发布8B-32k上下文的模型 \ No newline at end of file From 03c55e1fee99a456e077dca8dd03516113951e24 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Tue, 9 Jul 2024 15:07:35 +0800 Subject: [PATCH 27/28] Update README_ALL.md --- quick_start_clean/readmes/README_ALL.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/quick_start_clean/readmes/README_ALL.md b/quick_start_clean/readmes/README_ALL.md index f09736d..7c70899 100644 --- a/quick_start_clean/readmes/README_ALL.md +++ b/quick_start_clean/readmes/README_ALL.md @@ -13,10 +13,8 @@ [环境配置、算力资源](https://www.osredm.com/jiuyuan/CPM-9G-8B/tree/master/quick_start_clean/readmes/README_ENV.md) ## 开源模型 -1 目前启元开源了80B的百亿SFT模型: - v2版本主要是进行精度指标的优化和对话能力的提升 - [8b_v1](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz) - [8b_v2](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/sft_8b_v2.zip) +1 目前启元开源了80B的百亿SFT模型,v2版本是在v1基础上精度和对话能力的优化模型,下载链接: + [8b_v1版本](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/checkpoints-epoch-1.tar.gz), [8b_v2版本](https://qy-obs-6d58.obs.cn-north-4.myhuaweicloud.com/sft_8b_v2.zip) ## 数据构建 From 521489c5e199f9103482d52612ab8416e3195b70 Mon Sep 17 00:00:00 2001 From: p04896573 Date: Fri, 23 Aug 2024 10:51:01 +0800 Subject: [PATCH 28/28] Update README_ENV.md --- quick_start_clean/readmes/README_ENV.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quick_start_clean/readmes/README_ENV.md b/quick_start_clean/readmes/README_ENV.md index 00457e4..ed66da4 100644 --- a/quick_start_clean/readmes/README_ENV.md +++ b/quick_start_clean/readmes/README_ENV.md @@ -26,7 +26,7 @@ docker tag [IMAGE_ID] cpmlive-flash:0.0.4 ### 启动容器 ``` -docker run -it -d -v [HOST_PATH1]:[DOCKER_PATH1] -v [HOST_PATH2]:[DOCKER_PATH2] --gpus all --shm-size=4g --sh cpmlive-flash:0.0.4 bash +docker run -it -d -v [HOST_PATH1]:[DOCKER_PATH1] -v [HOST_PATH2]:[DOCKER_PATH2] --gpus all --shm-size=4g cpmlive-flash:0.0.4 bash ``` 如果有docker权限、且rootless执行错误的情况下,可以尝试下非rootless启动