support fsdp + qlora

2024-03-21 00:36:06 +08:00 · 2024-03-21 00:36:06 +08:00 · 8408225162
parent 3271af2afc
commit 8408225162
19 changed files with 87 additions and 19 deletions
--- a/README.md
+++ b/README.md
@ -70,17 +70,19 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 ## Changelog
 [24/03/20] We supported **FSDP + QLoRA** that fine-tunes a 70B model on 2x24GB GPUs. See `examples/fsdp_qlora` for usage.
 [24/03/13] We supported **[LoRA+](https://arxiv.org/abs/2402.12354)**. Try `loraplus_lr_ratio=16.0` to enable LoRA+ algorithm.
 [24/03/07] We supported gradient low-rank projection (**[GaLore](https://arxiv.org/abs/2403.03507)**) algorithm. Try `--use_galore` to use the memory-efficient optimizer.
 [24/03/07] We integrated **[vLLM](https://github.com/vllm-project/vllm)** for faster and concurrent inference. Try `--infer_backend vllm` to enjoy **270%** inference speed. (LoRA is not yet supported, merge it first.)
 <details><summary>Full Changelog</summary>
 [24/02/28] We supported weight-decomposed LoRA (**[DoRA](https://arxiv.org/abs/2402.09353)**). Try `--use_dora` to activate DoRA training.
-[24/02/15] We supported **block expansion** proposed by [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro). See `scripts/llama_pro.py` for usage.
+[24/02/15] We supported **block expansion** proposed by [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro). See `examples/extras/llama_pro` for usage.
 <details><summary>Full Changelog</summary>
 [24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this [blog post](https://qwenlm.github.io/blog/qwen1.5/) for details.
@ -238,6 +240,7 @@ You also can add a custom chat template to [template.py](src/llmtuner/data/templ
 - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
 - [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
 - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
 - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
--- a/README_zh.md
+++ b/README_zh.md
@ -70,17 +70,19 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 ## 更新日志
 [24/03/20] 我们支持了能在 2x24GB GPU 上微调 70B 模型的 **FSDP + QLoRA**。详细用法请参照 `examples/fsdp_qlora`。
 [24/03/13] 我们支持了 **[LoRA+](https://arxiv.org/abs/2402.12354)**。请使用 `loraplus_lr_ratio=16.0` 参数开启 LoRA+ 方法。
 [24/03/07] 我们支持了梯度低秩投影（**[GaLore](https://arxiv.org/abs/2403.03507)**）算法。请使用 `--use_galore` 参数切换显存高效的优化器。
 [24/03/07] 我们集成了 **[vLLM](https://github.com/vllm-project/vllm)** 以实现极速并发推理。请使用 `--infer_backend vllm` 来获得 **270%** 的推理速度。（尚不支持 LoRA，请先合并权重。）
 <details><summary>展开日志</summary>
 [24/02/28] 我们支持了 **[DoRA](https://arxiv.org/abs/2402.09353)** 微调。请使用 `--use_dora` 参数进行 DoRA 微调。
-[24/02/15] 我们支持了 [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro) 提出的**块扩展**方法。详细用法请参照 `scripts/llama_pro.py`。
+[24/02/15] 我们支持了 [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro) 提出的**块扩展**方法。详细用法请参照 `examples/extras/llama_pro`。
 <details><summary>展开日志</summary>
 [24/02/05] Qwen1.5（Qwen2 测试版）系列模型已在 LLaMA-Factory 中实现微调支持。详情请查阅该[博客页面](https://qwenlm.github.io/zh/blog/qwen1.5/)。
@ -238,6 +240,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
 - [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
 - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
 - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
--- a/examples/accelerate/fsdp_config.yaml
+++ b/examples/accelerate/fsdp_config.yaml
@ -0,0 +1,25 @@
 compute_environment: LOCAL_MACHINE
 debug: false
 distributed_type: FSDP
 downcast_bf16: 'no'
 fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
 machine_rank: 0
 main_training_function: main
 mixed_precision: fp16
 num_machines: 1
 num_processes: 2
 rdzv_backend: static
 same_network: true
 tpu_env: []
 tpu_use_cluster: false
 tpu_use_sudo: false
 use_cpu: false
--- a/examples/lora_multi_gpu/master_config.yaml
+++ b/examples/lora_multi_gpu/master_config.yaml
--- a/examples/lora_multi_gpu/single_config.yaml
+++ b/examples/lora_multi_gpu/single_config.yaml
--- a/examples/lora_multi_gpu/slave_config.yaml
+++ b/examples/lora_multi_gpu/slave_config.yaml
--- a/examples/full_multi_gpu/ds_z2_config.json
+++ b/examples/full_multi_gpu/ds_z2_config.json
--- a/examples/full_multi_gpu/ds_z2_offload_config.json
+++ b/examples/full_multi_gpu/ds_z2_offload_config.json
--- a/examples/full_multi_gpu/ds_z3_config.json
+++ b/examples/full_multi_gpu/ds_z3_config.json
--- a/examples/full_multi_gpu/ds_z3_offload_config.json
+++ b/examples/full_multi_gpu/ds_z3_offload_config.json
--- a/examples/fsdp_qlora/README.md
+++ b/examples/fsdp_qlora/README.md
@ -0,0 +1,5 @@
 ```bash
 pip install git+https://github.com/huggingface/transformers.git
 pip install "accelerate>=0.28.0"
 pip install "bitsandbytes>=0.43.0"
 ```
--- a/examples/fsdp_qlora/fsdp.sh
+++ b/examples/fsdp_qlora/fsdp.sh
@ -0,0 +1,33 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file ../accelerate/fsdp_config.yaml \
    ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-70b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-70B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --quantization_bit 4 \
    --plot_loss \
    --fp16
--- a/examples/full_multi_gpu/multi_node.sh
+++ b/examples/full_multi_gpu/multi_node.sh
@ -7,11 +7,11 @@ python -m torch.distributed.run \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    ../../src/train_bash.py \
-    --deepspeed ds_z3_config.json \
+    --deepspeed ../deepspeed/ds_z3_config.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
-    --dataset alpaca_gpt4_en \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
--- a/examples/full_multi_gpu/single_node.sh
+++ b/examples/full_multi_gpu/single_node.sh
@ -1,11 +1,11 @@
 #!/bin/bash
 deepspeed --num_gpus 4 ../../src/train_bash.py \
-    --deepspeed ds_z3_config.json \
+    --deepspeed ../deepspeed/ds_z3_config.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
-    --dataset alpaca_gpt4_en \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
--- a/examples/lora_multi_gpu/multi_node.sh
+++ b/examples/lora_multi_gpu/multi_node.sh
@ -1,7 +1,7 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
-    --config_file master_config.yaml \
+    --config_file ../accelerate/master_config.yaml \
    ../../src/train_bash.py \
    --stage sft \
    --do_train \
--- a/examples/lora_multi_gpu/single_node.sh
+++ b/examples/lora_multi_gpu/single_node.sh
@ -1,7 +1,7 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
-    --config_file single_config.yaml \
+    --config_file ../accelerate/single_config.yaml \
    ../../src/train_bash.py \
    --stage sft \
    --do_train \
--- a/requirements.txt
+++ b/requirements.txt
@ -3,7 +3,7 @@ transformers>=4.37.2
 datasets>=2.14.3
 accelerate>=0.27.2
 peft>=0.9.0
-trl>=0.7.11
+trl>=0.8.1
 gradio>=3.38.0,<4.0.0
 scipy
 einops
--- a/src/llmtuner/extras/misc.py
+++ b/src/llmtuner/extras/misc.py
@ -65,7 +65,7 @@ def check_dependencies() -> None:
        require_version("datasets>=2.14.3", "To fix: pip install datasets>=2.14.3")
        require_version("accelerate>=0.27.2", "To fix: pip install accelerate>=0.27.2")
        require_version("peft>=0.9.0", "To fix: pip install peft>=0.9.0")
-        require_version("trl>=0.7.11", "To fix: pip install trl>=0.7.11")
+        require_version("trl>=0.8.1", "To fix: pip install trl>=0.8.1")
 def count_parameters(model: torch.nn.Module) -> Tuple[int, int]:
@ -81,7 +81,8 @@ def count_parameters(model: torch.nn.Module) -> Tuple[int, int]:
        # Due to the design of 4bit linear layers from bitsandbytes, multiply the number of parameters by 2
        if param.__class__.__name__ == "Params4bit":
-            num_params = num_params * 2
+            num_bytes = param.quant_storage.itemsize if hasattr(param, "quant_storage") else 1
            num_params = num_params * 2 * num_bytes
        all_param += num_params
        if param.requires_grad:
--- a/src/llmtuner/model/patcher.py
+++ b/src/llmtuner/model/patcher.py
@ -210,9 +210,6 @@ def _configure_quantization(
        logger.info("Quantizing model to {} bit.".format(model_args.export_quantization_bit))
    elif model_args.quantization_bit is not None:  # bnb
        if is_deepspeed_zero3_enabled():
            require_version("bitsandbytes>=0.43.0", "To fix: pip install bitsandbytes>=0.43.0")
        if model_args.quantization_bit == 8:
            require_version("bitsandbytes>=0.37.0", "To fix: pip install bitsandbytes>=0.37.0")
            init_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
@ -224,6 +221,7 @@ def _configure_quantization(
                bnb_4bit_compute_dtype=model_args.compute_dtype,
                bnb_4bit_use_double_quant=model_args.double_quantization,
                bnb_4bit_quant_type=model_args.quantization_type,
                bnb_4bit_quant_storage=model_args.compute_dtype,  # crucial for fsdp qlora
            )
        init_kwargs["device_map"] = {"": get_current_device()}
@ -300,7 +298,7 @@ def patch_config(
    init_kwargs["torch_dtype"] = model_args.compute_dtype
    if not is_deepspeed_zero3_enabled():
        init_kwargs["low_cpu_mem_usage"] = model_args.low_cpu_mem_usage
-        if model_args.low_cpu_mem_usage:
+        if init_kwargs["low_cpu_mem_usage"]:
            if "device_map" not in init_kwargs:  # quant models cannot use auto device map
                init_kwargs["device_map"] = model_args.device_map or {"": get_current_device()}