update readme and examples

2024-04-22 00:37:32 +08:00 · 2024-04-22 00:37:32 +08:00 · a1f1fac33b
parent ddbd29d777
commit a1f1fac33b
12 changed files with 266 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -329,7 +329,7 @@ To enable FlashAttention-2 on the Windows platform, you need to install the prec

 </details>

-### LLaMA Board GUI
+### Train with LLaMA Board GUI

 > [!IMPORTANT]
 > LLaMA Board GUI only supports training on a single GPU, please use [CLI](#command-line-interface) for distributed training.
@ -381,7 +381,7 @@ docker compose -f ./docker-compose.yml up -d

 </details>

-### Command Line Interface
+### Train with Command Line Interface

 See [examples/README.md](examples/README.md) for usage.

@ -397,7 +397,7 @@ CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
    --vllm_enforce_eager
 ```

-### Use ModelScope Hub
+### Download from ModelScope Hub

 If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

@ -405,7 +405,7 @@ If you have trouble with downloading models and datasets from Hugging Face, you
 export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows
 ```

-Train the model by specifying a model ID of the ModelScope Hub as the `--model_name_or_path`. You can find a full list of model IDs at [ModelScope Hub](https://modelscope.cn/models), e.g., `modelscope/Llama-2-7b-ms`.
+Train the model by specifying a model ID of the ModelScope Hub as the `--model_name_or_path`. You can find a full list of model IDs at [ModelScope Hub](https://modelscope.cn/models), e.g., `LLM-Research/Meta-Llama-3-8B-Instruct`.

 ## Projects using LLaMA Factory

--- a/README_zh.md
+++ b/README_zh.md
@ -329,10 +329,10 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl

 </details>

-### LLaMA Board 可视化界面
+### 利用 LLaMA Board 可视化界面训练

 > [!IMPORTANT]
-> LLaMA Board 可视化界面目前仅支持单 GPU 训练，请使用[命令行接口](#命令行接口)来进行分布式训练。
+> LLaMA Board 可视化界面目前仅支持单 GPU 训练，请使用[命令行接口](#命令行接口)来进行多 GPU 分布式训练。

 #### 使用本地环境

@ -381,13 +381,13 @@ docker compose -f ./docker-compose.yml up -d

 </details>

-### 命令行接口
+### 利用命令行接口训练

 使用方法请参考 [examples/README_zh.md](examples/README_zh.md)。

-使用 `python src/train_bash.py -h` 查看参数文档。
+您可以执行 `python src/train_bash.py -h` 来查看参数文档。

-### 使用 OpenAI 风格 API 和 vLLM 部署
+### 利用 vLLM 部署 OpenAI API

 ```bash
 CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
@ -397,7 +397,7 @@ CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
    --vllm_enforce_eager
 ```

-### 使用魔搭社区
+### 从魔搭社区下载

 如果您在 Hugging Face 模型和数据集的下载中遇到了问题，可以通过下述方法使用魔搭社区。

@ -405,7 +405,7 @@ CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
 export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`
 ```

-将 `--model_name_or_path` 设置为模型 ID 来加载对应的模型。在[魔搭社区](https://modelscope.cn/models)查看所有可用的模型，例如 `modelscope/Llama-2-7b-ms`。
+将 `--model_name_or_path` 设置为模型 ID 来加载对应的模型。在[魔搭社区](https://modelscope.cn/models)查看所有可用的模型，例如 `LLM-Research/Meta-Llama-3-8B-Instruct`。

 ## 使用了 LLaMA Factory 的项目

--- a/examples/README.md
+++ b/examples/README.md
@ -18,7 +18,8 @@ examples/
 │   └── aqlm.sh: Fine-tune 2-bit AQLM models using QLoRA
 ├── lora_multi_gpu/
 │   ├── single_node.sh: Fine-tune model with Accelerate on single node using LoRA
-│   └── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA
+│   ├── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA
+│   └── ds_zero3.sh: Fine-tune model with DeepSpeed ZeRO-3 using LoRA
 ├── full_multi_gpu/
 │   ├── single_node.sh: Full fine-tune model with DeepSpeed on single node
 │   ├── multi_node.sh: Full fine-tune model with DeepSpeed on multiple nodes
--- a/examples/README_zh.md
+++ b/examples/README_zh.md
@ -18,7 +18,8 @@ examples/
 │   └── aqlm.sh: 基于 QLoRA 微调 2 比特 AQLM 模型
 ├── lora_multi_gpu/
 │   ├── single_node.sh: 使用 Accelerate 进行单节点 LoRA 训练
-│   └── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练
+│   ├── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练
+│   └── ds_zero3.sh: 使用 DeepSpeed ZeRO-3 进行 LoRA 训练
 ├── full_multi_gpu/
 │   ├── single_node.sh: 使用 DeepSpeed 进行单节点全量训练
 │   ├── multi_node.sh: 使用 DeepSpeed 进行多节点全量训练
--- a/examples/extras/badam/sft.sh
+++ b/examples/extras/badam/sft.sh
@ -0,0 +1,35 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type full \
+    --use_badam \
+    --badam_switch_mode descending \
+    --badam_switch_block_every 50 \
+    --badam_verbose 2 \
+    --output_dir ../../../saves/LLaMA2-7B/badam/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --pure_bf16
--- a/examples/extras/fsdp_qlora/sft.sh
+++ b/examples/extras/fsdp_qlora/sft.sh
@ -0,0 +1,41 @@
+#!/bin/bash
+# DO NOT use GPTQ/AWQ model in FSDP+QLoRA
+
+pip install "transformers>=4.39.1"
+pip install "accelerate>=0.28.0"
+pip install "bitsandbytes>=0.43.0"
+
+CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
+    --config_file ../../accelerate/fsdp_config.yaml \
+    ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-70b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../../saves/LLaMA2-70B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 4 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --quantization_bit 4 \
+    --plot_loss \
+    --fp16
--- a/examples/extras/galore/sft.sh
+++ b/examples/extras/galore/sft.sh
@ -0,0 +1,36 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type full \
+    --use_galore \
+    --galore_layerwise \
+    --galore_target mlp,self_attn \
+    --galore_rank 128 \
+    --galore_scale 2.0 \
+    --output_dir ../../../saves/LLaMA2-7B/galore/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --pure_bf16
--- a/examples/extras/llama_pro/expand.sh
+++ b/examples/extras/llama_pro/expand.sh
@ -0,0 +1,6 @@
+#!/bin/bash
+
+python ../../../scripts/llama_pro.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --output_dir ../../../models/llama2-7b-pro \
+    --num_expand 8
--- a/examples/extras/llama_pro/sft.sh
+++ b/examples/extras/llama_pro/sft.sh
@ -0,0 +1,34 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path ../../../models/llama2-7b-pro \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type freeze \
+    --name_module_trainable all \
+    --num_layer_trainable 8 \
+    --use_llama_pro \
+    --output_dir ../../../saves/LLaMA2-7B-Pro/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --fp16
--- a/examples/extras/loraplus/sft.sh
+++ b/examples/extras/loraplus/sft.sh
@ -0,0 +1,33 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --loraplus_lr_ratio 16.0 \
+    --output_dir ../../saves/LLaMA2-7B/loraplus/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --fp16
--- a/examples/extras/mod/sft.sh
+++ b/examples/extras/mod/sft.sh
@ -0,0 +1,33 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type full \
+    --mixture_of_depths convert \
+    --output_dir ../../../saves/LLaMA2-7B/mod/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --optim paged_adamw_8bit \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --pure_bf16
--- a/examples/lora_multi_gpu/ds_zero3.sh
+++ b/examples/lora_multi_gpu/ds_zero3.sh
@ -0,0 +1,33 @@
+#!/bin/bash
+
+deepspeed --num_gpus 4 ../../src/train_bash.py \
+    --deepspeed ../deepspeed/ds_z3_config.json \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../saves/LLaMA2-7B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16