fix mod stuff
This commit is contained in:
parent
d0273787be
commit
f58425ab45
|
@ -46,7 +46,7 @@ Choose your path:
|
||||||
- **Various models**: LLaMA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
|
- **Various models**: LLaMA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
|
||||||
- **Integrated methods**: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO.
|
- **Integrated methods**: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO.
|
||||||
- **Scalable resources**: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
|
- **Scalable resources**: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
|
||||||
- **Advanced algorithms**: GaLore, Mixture of Depths, BAdam, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning.
|
- **Advanced algorithms**: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and Agent tuning.
|
||||||
- **Practical tricks**: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
|
- **Practical tricks**: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
|
||||||
- **Experiment monitors**: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
|
- **Experiment monitors**: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
|
||||||
- **Faster inference**: OpenAI-style API, Gradio UI and CLI with vLLM worker.
|
- **Faster inference**: OpenAI-style API, Gradio UI and CLI with vLLM worker.
|
||||||
|
@ -68,16 +68,16 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
|
||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
[24/04/19] We integrated **[Mixture of Depths](https://github.com/astramind-ai/Mixture-of-depths)**. see `examples/extras/MoD` for usage.
|
[24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See `examples/extras/mod` for usage.
|
||||||
|
|
||||||
[24/04/19] We supported **Meta Llama 3** model series.
|
[24/04/19] We supported **Meta Llama 3** model series.
|
||||||
|
|
||||||
[24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See `examples/extras/badam` for usage.
|
[24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See `examples/extras/badam` for usage.
|
||||||
|
|
||||||
<details><summary>Full Changelog</summary>
|
|
||||||
|
|
||||||
[24/04/16] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s long-sequence training (Llama-2-7B-56k within 24GB). It achieves **117%** speed and **50%** memory compared with FlashAttention-2, more benchmarks can be found in [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison).
|
[24/04/16] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s long-sequence training (Llama-2-7B-56k within 24GB). It achieves **117%** speed and **50%** memory compared with FlashAttention-2, more benchmarks can be found in [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison).
|
||||||
|
|
||||||
|
<details><summary>Full Changelog</summary>
|
||||||
|
|
||||||
[24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See `examples/lora_single_gpu` for usage.
|
[24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See `examples/lora_single_gpu` for usage.
|
||||||
|
|
||||||
[24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv!
|
[24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv!
|
||||||
|
@ -251,6 +251,7 @@ You also can add a custom chat template to [template.py](src/llmtuner/data/templ
|
||||||
- [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
- [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||||
- [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
|
- [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
|
||||||
- [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
|
- [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
|
||||||
|
- [DPO mix (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
|
||||||
- [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
|
- [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
|
@ -46,7 +46,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
|
||||||
- **多种模型**:LLaMA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。
|
- **多种模型**:LLaMA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。
|
||||||
- **集成方法**:(增量)预训练、指令监督微调、奖励模型训练、PPO 训练、DPO 训练和 ORPO 训练。
|
- **集成方法**:(增量)预训练、指令监督微调、奖励模型训练、PPO 训练、DPO 训练和 ORPO 训练。
|
||||||
- **多种精度**:32 比特全参数微调、16 比特冻结微调、16 比特 LoRA 微调和基于 AQLM/AWQ/GPTQ/LLM.int8 的 2/4/8 比特 QLoRA 微调。
|
- **多种精度**:32 比特全参数微调、16 比特冻结微调、16 比特 LoRA 微调和基于 AQLM/AWQ/GPTQ/LLM.int8 的 2/4/8 比特 QLoRA 微调。
|
||||||
- **先进算法**:GaLore、Mixture of Depths、BAdam、DoRA、LongLoRA、LLaMA Pro、LoRA+、LoftQ 和 Agent 微调。
|
- **先进算法**:GaLore、BAdam、DoRA、LongLoRA、LLaMA Pro、Mixture-of-Depths、LoRA+、LoftQ 和 Agent 微调。
|
||||||
- **实用技巧**:FlashAttention-2、Unsloth、RoPE scaling、NEFTune 和 rsLoRA。
|
- **实用技巧**:FlashAttention-2、Unsloth、RoPE scaling、NEFTune 和 rsLoRA。
|
||||||
- **实验监控**:LlamaBoard、TensorBoard、Wandb、MLflow 等等。
|
- **实验监控**:LlamaBoard、TensorBoard、Wandb、MLflow 等等。
|
||||||
- **极速推理**:基于 vLLM 的 OpenAI 风格 API、浏览器界面和命令行接口。
|
- **极速推理**:基于 vLLM 的 OpenAI 风格 API、浏览器界面和命令行接口。
|
||||||
|
@ -68,16 +68,16 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
|
||||||
|
|
||||||
## 更新日志
|
## 更新日志
|
||||||
|
|
||||||
[24/04/19] 我们整合了 **[深度混合](https://github.com/astramind-ai/Mixture-of-depths)**。用法请参见 `examples/extras/MoD`。
|
[24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 `examples/extras/mod`。
|
||||||
|
|
||||||
[24/04/19] 我们支持了 **Meta Llama 3** 系列模型。
|
[24/04/19] 我们支持了 **Meta Llama 3** 系列模型。
|
||||||
|
|
||||||
[24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 `examples/extras/badam`。
|
[24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 `examples/extras/badam`。
|
||||||
|
|
||||||
<details><summary>展开日志</summary>
|
|
||||||
|
|
||||||
[24/04/16] 我们支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的长序列训练(24GB 可训练 Llama-2-7B-56k)。该方法相比 FlashAttention-2 提供了 **117%** 的训练速度和 **50%** 的显存节约。更多数据请见[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
|
[24/04/16] 我们支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的长序列训练(24GB 可训练 Llama-2-7B-56k)。该方法相比 FlashAttention-2 提供了 **117%** 的训练速度和 **50%** 的显存节约。更多数据请见[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
|
||||||
|
|
||||||
|
<details><summary>展开日志</summary>
|
||||||
|
|
||||||
[24/03/31] 我们支持了 **[ORPO](https://arxiv.org/abs/2403.07691)**。详细用法请参照 `examples/lora_single_gpu`。
|
[24/03/31] 我们支持了 **[ORPO](https://arxiv.org/abs/2403.07691)**。详细用法请参照 `examples/lora_single_gpu`。
|
||||||
|
|
||||||
[24/03/21] 我们的论文 "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" 可在 arXiv 上查看!
|
[24/03/21] 我们的论文 "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" 可在 arXiv 上查看!
|
||||||
|
@ -251,6 +251,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
|
||||||
- [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
- [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
|
||||||
- [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
|
- [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
|
||||||
- [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
|
- [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
|
||||||
|
- [DPO mix (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
|
||||||
- [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
|
- [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
|
@ -38,12 +38,11 @@ examples/
|
||||||
│ └── sft.sh: Fine-tune model with BAdam
|
│ └── sft.sh: Fine-tune model with BAdam
|
||||||
├── loraplus/
|
├── loraplus/
|
||||||
│ └── sft.sh: Fine-tune model using LoRA+
|
│ └── sft.sh: Fine-tune model using LoRA+
|
||||||
|
├── mod/
|
||||||
|
│ └── sft.sh: Fine-tune model using Mixture-of-Depths
|
||||||
├── llama_pro/
|
├── llama_pro/
|
||||||
│ ├── expand.sh: Expand layers in the model
|
│ ├── expand.sh: Expand layers in the model
|
||||||
│ └── sft.sh: Fine-tune the expanded model
|
│ └── sft.sh: Fine-tune the expanded model
|
||||||
├── MoD/
|
|
||||||
│ ├── freeze_sft.sh: Freeze finetune a model, updating only the MoD router
|
|
||||||
│ └── sft.sh: Fine-tune the MoD model
|
|
||||||
└── fsdp_qlora/
|
└── fsdp_qlora/
|
||||||
└── sft.sh: Fine-tune quantized model with FSDP+QLoRA
|
└── sft.sh: Fine-tune quantized model with FSDP+QLoRA
|
||||||
```
|
```
|
||||||
|
|
|
@ -38,12 +38,11 @@ examples/
|
||||||
│ └── sft.sh: 使用 BAdam 训练模型
|
│ └── sft.sh: 使用 BAdam 训练模型
|
||||||
├── loraplus/
|
├── loraplus/
|
||||||
│ └── sft.sh: 使用 LoRA+ 训练模型
|
│ └── sft.sh: 使用 LoRA+ 训练模型
|
||||||
|
├── mod/
|
||||||
|
│ └── sft.sh: 使用深度混合训练模型
|
||||||
├── llama_pro/
|
├── llama_pro/
|
||||||
│ ├── expand.sh: 扩展模型中的层
|
│ ├── expand.sh: 扩展模型中的层
|
||||||
│ └── sft.sh: 训练扩展后的模型
|
│ └── sft.sh: 训练扩展后的模型
|
||||||
├── MoD/
|
|
||||||
│ ├── freeze_sft.sh: 冻结微调模型,仅更新 MoD 路由器
|
|
||||||
│ └── sft.sh: 微调国防部模型
|
|
||||||
└── fsdp_qlora/
|
└── fsdp_qlora/
|
||||||
└── sft.sh: 使用 FSDP+QLoRA 微调量化模型
|
└── sft.sh: 使用 FSDP+QLoRA 微调量化模型
|
||||||
```
|
```
|
||||||
|
|
|
@ -1,33 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
|
|
||||||
CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
|
|
||||||
--stage sft \
|
|
||||||
--do_train \
|
|
||||||
--model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
|
||||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
|
||||||
--dataset_dir ../../../data \
|
|
||||||
--template default \
|
|
||||||
--finetuning_type freeze \
|
|
||||||
--name_module_trainable router \
|
|
||||||
--output_dir ../../../saves/TinyLlama/TinyLlama-1.1B-Chat-v1.0/sft \
|
|
||||||
--mixture_of_depths convert \
|
|
||||||
--overwrite_cache \
|
|
||||||
--overwrite_output_dir \
|
|
||||||
--cutoff_len 1024 \
|
|
||||||
--preprocessing_num_workers 16 \
|
|
||||||
--per_device_train_batch_size 1 \
|
|
||||||
--per_device_eval_batch_size 1 \
|
|
||||||
--gradient_accumulation_steps 1 \
|
|
||||||
--lr_scheduler_type cosine \
|
|
||||||
--logging_steps 10 \
|
|
||||||
--warmup_steps 20 \
|
|
||||||
--save_steps 100 \
|
|
||||||
--eval_steps 100 \
|
|
||||||
--evaluation_strategy steps \
|
|
||||||
--load_best_model_at_end \
|
|
||||||
--learning_rate 5e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--max_samples 3000 \
|
|
||||||
--val_size 0.1 \
|
|
||||||
--plot_loss \
|
|
||||||
--pure_bf16
|
|
|
@ -3,20 +3,21 @@
|
||||||
CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
|
CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
|
||||||
--stage sft \
|
--stage sft \
|
||||||
--do_train \
|
--do_train \
|
||||||
--model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||||
--dataset alpaca_gpt4_en,glaive_toolcall \
|
--dataset alpaca_gpt4_en,glaive_toolcall \
|
||||||
--dataset_dir ../../../data \
|
--dataset_dir ../../../data \
|
||||||
--template default \
|
--template default \
|
||||||
--finetuning_type full \
|
--finetuning_type full \
|
||||||
--output_dir ../../../saves/TinyLlama/TinyLlama-1.1B-Chat-v1.0/sft \
|
|
||||||
--mixture_of_depths convert \
|
--mixture_of_depths convert \
|
||||||
|
--output_dir ../../../saves/LLaMA2-7B/mod/sft \
|
||||||
--overwrite_cache \
|
--overwrite_cache \
|
||||||
--overwrite_output_dir \
|
--overwrite_output_dir \
|
||||||
--cutoff_len 1024 \
|
--cutoff_len 1024 \
|
||||||
--preprocessing_num_workers 16 \
|
--preprocessing_num_workers 16 \
|
||||||
--per_device_train_batch_size 1 \
|
--per_device_train_batch_size 1 \
|
||||||
--per_device_eval_batch_size 1 \
|
--per_device_eval_batch_size 1 \
|
||||||
--gradient_accumulation_steps 1 \
|
--gradient_accumulation_steps 8 \
|
||||||
|
--optim paged_adamw_8bit \
|
||||||
--lr_scheduler_type cosine \
|
--lr_scheduler_type cosine \
|
||||||
--logging_steps 10 \
|
--logging_steps 10 \
|
||||||
--warmup_steps 20 \
|
--warmup_steps 20 \
|
||||||
|
|
|
@ -11,6 +11,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
|
||||||
--use_galore \
|
--use_galore \
|
||||||
--galore_layerwise \
|
--galore_layerwise \
|
||||||
--galore_target mlp,self_attn \
|
--galore_target mlp,self_attn \
|
||||||
|
--galore_scale 2.0 \
|
||||||
--galore_rank 128 \
|
--galore_rank 128 \
|
||||||
--output_dir ../../../saves/LLaMA2-7B/galore/sft \
|
--output_dir ../../../saves/LLaMA2-7B/galore/sft \
|
||||||
--overwrite_cache \
|
--overwrite_cache \
|
||||||
|
@ -28,8 +29,8 @@ CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
|
||||||
--evaluation_strategy steps \
|
--evaluation_strategy steps \
|
||||||
--load_best_model_at_end \
|
--load_best_model_at_end \
|
||||||
--learning_rate 5e-5 \
|
--learning_rate 5e-5 \
|
||||||
--num_train_epochs 3.0 \
|
--num_train_epochs 30.0 \
|
||||||
--max_samples 3000 \
|
--max_samples 300 \
|
||||||
--val_size 0.1 \
|
--val_size 0.1 \
|
||||||
--plot_loss \
|
--plot_loss \
|
||||||
--pure_bf16
|
--pure_bf16
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
CUDA_VISIBLE_DEVICES=0 python ../../src/evaluate.py \
|
CUDA_VISIBLE_DEVICES=0 python ../../src/evaluate.py \
|
||||||
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
--model_name_or_path meta-llama/Llama-2-7b-hf \
|
||||||
--adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
|
--adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
|
||||||
--template vanilla \
|
--template fewshot \
|
||||||
--finetuning_type lora \
|
--finetuning_type lora \
|
||||||
--task mmlu \
|
--task mmlu \
|
||||||
--split test \
|
--split test \
|
||||||
|
|
|
@ -343,7 +343,7 @@ def get_template_and_fix_tokenizer(
|
||||||
name: Optional[str] = None,
|
name: Optional[str] = None,
|
||||||
) -> Template:
|
) -> Template:
|
||||||
if name is None:
|
if name is None:
|
||||||
template = templates["vanilla"] # placeholder
|
template = templates["empty"] # placeholder
|
||||||
else:
|
else:
|
||||||
template = templates.get(name, None)
|
template = templates.get(name, None)
|
||||||
if template is None:
|
if template is None:
|
||||||
|
@ -385,7 +385,8 @@ _register_template(
|
||||||
format_user=StringFormatter(slots=["### Instruction:\n{{content}}\n\n### Response:\n"]),
|
format_user=StringFormatter(slots=["### Instruction:\n{{content}}\n\n### Response:\n"]),
|
||||||
format_separator=EmptyFormatter(slots=["\n\n"]),
|
format_separator=EmptyFormatter(slots=["\n\n"]),
|
||||||
default_system=(
|
default_system=(
|
||||||
"Below is an instruction that describes a task. " "Write a response that appropriately completes the request."
|
"Below is an instruction that describes a task. "
|
||||||
|
"Write a response that appropriately completes the request.\n\n"
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -596,6 +597,13 @@ _register_template(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_register_template(
|
||||||
|
name="fewshot",
|
||||||
|
format_separator=EmptyFormatter(slots=["\n\n"]),
|
||||||
|
efficient_eos=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
_register_template(
|
_register_template(
|
||||||
name="gemma",
|
name="gemma",
|
||||||
format_user=StringFormatter(slots=["<start_of_turn>user\n{{content}}<end_of_turn>\n<start_of_turn>model\n"]),
|
format_user=StringFormatter(slots=["<start_of_turn>user\n{{content}}<end_of_turn>\n<start_of_turn>model\n"]),
|
||||||
|
@ -740,13 +748,6 @@ _register_template(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
_register_template(
|
|
||||||
name="vanilla",
|
|
||||||
format_separator=EmptyFormatter(slots=["\n"]),
|
|
||||||
efficient_eos=True,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
_register_template(
|
_register_template(
|
||||||
name="vicuna",
|
name="vicuna",
|
||||||
format_user=StringFormatter(slots=["USER: {{content}} ASSISTANT:"]),
|
format_user=StringFormatter(slots=["USER: {{content}} ASSISTANT:"]),
|
||||||
|
|
|
@ -28,6 +28,8 @@ LOG_FILE_NAME = "trainer_log.jsonl"
|
||||||
|
|
||||||
METHODS = ["full", "freeze", "lora"]
|
METHODS = ["full", "freeze", "lora"]
|
||||||
|
|
||||||
|
MOD_SUPPORTED_MODELS = ["bloom", "falcon", "gemma", "llama", "mistral", "mixtral", "phi", "starcoder2"]
|
||||||
|
|
||||||
PEFT_METHODS = ["lora"]
|
PEFT_METHODS = ["lora"]
|
||||||
|
|
||||||
SUBJECTS = ["Average", "STEM", "Social Sciences", "Humanities", "Other"]
|
SUBJECTS = ["Average", "STEM", "Social Sciences", "Humanities", "Other"]
|
||||||
|
|
|
@ -83,6 +83,8 @@ def count_parameters(model: torch.nn.Module) -> Tuple[int, int]:
|
||||||
if param.__class__.__name__ == "Params4bit":
|
if param.__class__.__name__ == "Params4bit":
|
||||||
if hasattr(param, "quant_storage") and hasattr(param.quant_storage, "itemsize"):
|
if hasattr(param, "quant_storage") and hasattr(param.quant_storage, "itemsize"):
|
||||||
num_bytes = param.quant_storage.itemsize
|
num_bytes = param.quant_storage.itemsize
|
||||||
|
elif hasattr(param, "element_size"): # for older pytorch version
|
||||||
|
num_bytes = param.element_size()
|
||||||
else:
|
else:
|
||||||
num_bytes = 1
|
num_bytes = 1
|
||||||
|
|
||||||
|
|
|
@ -63,15 +63,15 @@ class ModelArguments:
|
||||||
)
|
)
|
||||||
flash_attn: bool = field(
|
flash_attn: bool = field(
|
||||||
default=False,
|
default=False,
|
||||||
metadata={"help": "Enable FlashAttention-2 for faster training."},
|
metadata={"help": "Enable FlashAttention for faster training."},
|
||||||
)
|
)
|
||||||
shift_attn: bool = field(
|
shift_attn: bool = field(
|
||||||
default=False,
|
default=False,
|
||||||
metadata={"help": "Enable shift short attention (S^2-Attn) proposed by LongLoRA."},
|
metadata={"help": "Enable shift short attention (S^2-Attn) proposed by LongLoRA."},
|
||||||
)
|
)
|
||||||
mixture_of_depths: Optional[Literal["convert", "continue"]] = field(
|
mixture_of_depths: Optional[Literal["convert", "load"]] = field(
|
||||||
default=None,
|
default=None,
|
||||||
metadata={"help": "Whether or not to use MoD in the model."},
|
metadata={"help": "Convert the model to mixture-of-depths (MoD) or load the MoD model."},
|
||||||
)
|
)
|
||||||
use_unsloth: bool = field(
|
use_unsloth: bool = field(
|
||||||
default=False,
|
default=False,
|
||||||
|
|
|
@ -82,8 +82,8 @@ def _check_extra_dependencies(
|
||||||
if model_args.use_unsloth:
|
if model_args.use_unsloth:
|
||||||
require_version("unsloth", "Please install unsloth: https://github.com/unslothai/unsloth")
|
require_version("unsloth", "Please install unsloth: https://github.com/unslothai/unsloth")
|
||||||
|
|
||||||
if model_args.mixture_of_depths == 'convert' or model_args.mixture_of_depths == 'continue':
|
if model_args.mixture_of_depths is not None:
|
||||||
require_version("mixture-of-depth", "To fix: pip install mixture-of-depth")
|
require_version("mixture-of-depth>=1.1.6", "To fix: pip install mixture-of-depth>=1.1.6")
|
||||||
|
|
||||||
if model_args.infer_backend == "vllm":
|
if model_args.infer_backend == "vllm":
|
||||||
require_version("vllm>=0.3.3", "To fix: pip install vllm>=0.3.3")
|
require_version("vllm>=0.3.3", "To fix: pip install vllm>=0.3.3")
|
||||||
|
|
|
@ -69,7 +69,7 @@ def init_adapter(
|
||||||
for name, _ in model.named_modules():
|
for name, _ in model.named_modules():
|
||||||
if ".0." in name:
|
if ".0." in name:
|
||||||
freeze_modules.add(name.split(".0.")[-1].split(".")[0])
|
freeze_modules.add(name.split(".0.")[-1].split(".")[0])
|
||||||
elif ".1." in name: # here since MoD starts from layer 1
|
elif ".1." in name: # MoD starts from layer 1
|
||||||
freeze_modules.add(name.split(".1.")[-1].split(".")[0])
|
freeze_modules.add(name.split(".1.")[-1].split(".")[0])
|
||||||
|
|
||||||
trainable_layers = []
|
trainable_layers = []
|
||||||
|
|
|
@ -3,6 +3,7 @@ from typing import TYPE_CHECKING, Any, Dict
|
||||||
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
|
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||||
from trl import AutoModelForCausalLMWithValueHead
|
from trl import AutoModelForCausalLMWithValueHead
|
||||||
|
|
||||||
|
from ..extras.constants import MOD_SUPPORTED_MODELS
|
||||||
from ..extras.logging import get_logger
|
from ..extras.logging import get_logger
|
||||||
from ..extras.misc import count_parameters, get_current_device, try_download_model_from_ms
|
from ..extras.misc import count_parameters, get_current_device, try_download_model_from_ms
|
||||||
from .adapter import init_adapter
|
from .adapter import init_adapter
|
||||||
|
@ -44,7 +45,7 @@ def load_tokenizer(model_args: "ModelArguments") -> "PreTrainedTokenizer":
|
||||||
padding_side="right",
|
padding_side="right",
|
||||||
**init_kwargs,
|
**init_kwargs,
|
||||||
)
|
)
|
||||||
except Exception: # try the fast one
|
except ValueError: # try the fast one
|
||||||
tokenizer = AutoTokenizer.from_pretrained(
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
model_args.model_name_or_path,
|
model_args.model_name_or_path,
|
||||||
use_fast=True,
|
use_fast=True,
|
||||||
|
@ -71,12 +72,6 @@ def load_model(
|
||||||
patch_config(config, tokenizer, model_args, init_kwargs, is_trainable)
|
patch_config(config, tokenizer, model_args, init_kwargs, is_trainable)
|
||||||
|
|
||||||
model = None
|
model = None
|
||||||
if model_args.mixture_of_depths == 'continue':
|
|
||||||
from MoD import AutoMoDModelForCausalLM
|
|
||||||
model = AutoMoDModelForCausalLM.from_pretrained(model_args.model_name_or_path, config=config)
|
|
||||||
if model.config.model_type == 'qwen2':
|
|
||||||
RuntimeError("Qwen models are not supported for MoD training.")
|
|
||||||
|
|
||||||
if is_trainable and model_args.use_unsloth:
|
if is_trainable and model_args.use_unsloth:
|
||||||
from unsloth import FastLanguageModel # type: ignore
|
from unsloth import FastLanguageModel # type: ignore
|
||||||
|
|
||||||
|
@ -104,14 +99,22 @@ def load_model(
|
||||||
if model is None:
|
if model is None:
|
||||||
init_kwargs["config"] = config
|
init_kwargs["config"] = config
|
||||||
init_kwargs["pretrained_model_name_or_path"] = model_args.model_name_or_path
|
init_kwargs["pretrained_model_name_or_path"] = model_args.model_name_or_path
|
||||||
model: "PreTrainedModel" = AutoModelForCausalLM.from_pretrained(**init_kwargs)
|
|
||||||
|
|
||||||
if model_args.mixture_of_depths == 'convert':
|
if model_args.mixture_of_depths == "load":
|
||||||
from MoD import convert_hf_model
|
from MoD import AutoMoDModelForCausalLM
|
||||||
if model.config.model_type == 'qwen2':
|
|
||||||
RuntimeError("Qwen models are not supported for MoD training.")
|
|
||||||
model = convert_hf_model(model)
|
|
||||||
|
|
||||||
|
model = AutoMoDModelForCausalLM.from_pretrained(**init_kwargs)
|
||||||
|
else:
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
|
||||||
|
|
||||||
|
if model_args.mixture_of_depths == "convert":
|
||||||
|
from MoD import apply_mod_to_hf
|
||||||
|
|
||||||
|
if getattr(config, "model_type", None) not in MOD_SUPPORTED_MODELS:
|
||||||
|
raise ValueError("Current model is not supported by mixture-of-depth.")
|
||||||
|
|
||||||
|
model = apply_mod_to_hf(model)
|
||||||
|
model = model.to(model_args.compute_dtype)
|
||||||
|
|
||||||
patch_model(model, tokenizer, model_args, is_trainable)
|
patch_model(model, tokenizer, model_args, is_trainable)
|
||||||
register_autoclass(config, model, tokenizer)
|
register_autoclass(config, model, tokenizer)
|
||||||
|
@ -119,7 +122,7 @@ def load_model(
|
||||||
model = init_adapter(model, model_args, finetuning_args, is_trainable)
|
model = init_adapter(model, model_args, finetuning_args, is_trainable)
|
||||||
|
|
||||||
if add_valuehead:
|
if add_valuehead:
|
||||||
model: "AutoModelForCausalLMWithValueHead" = AutoModelForCausalLMWithValueHead.from_pretrained(model)
|
model = AutoModelForCausalLMWithValueHead.from_pretrained(model)
|
||||||
patch_valuehead_model(model)
|
patch_valuehead_model(model)
|
||||||
|
|
||||||
if model_args.adapter_name_or_path is not None:
|
if model_args.adapter_name_or_path is not None:
|
||||||
|
|
|
@ -61,9 +61,7 @@ def _get_quantization_dataset(tokenizer: "PreTrainedTokenizer", model_args: "Mod
|
||||||
return samples
|
return samples
|
||||||
|
|
||||||
|
|
||||||
def _configure_attn_implementation(
|
def _configure_attn_implementation(config: "PretrainedConfig", model_args: "ModelArguments") -> None:
|
||||||
config: "PretrainedConfig", model_args: "ModelArguments", init_kwargs: Dict[str, Any]
|
|
||||||
) -> None:
|
|
||||||
if model_args.flash_attn:
|
if model_args.flash_attn:
|
||||||
if not is_flash_attn2_available():
|
if not is_flash_attn2_available():
|
||||||
logger.warning("FlashAttention2 is not installed.")
|
logger.warning("FlashAttention2 is not installed.")
|
||||||
|
@ -73,9 +71,9 @@ def _configure_attn_implementation(
|
||||||
if getattr(config, "model_type", None) == "internlm2": # special case for custom models
|
if getattr(config, "model_type", None) == "internlm2": # special case for custom models
|
||||||
setattr(config, "attn_implementation", "flash_attention_2")
|
setattr(config, "attn_implementation", "flash_attention_2")
|
||||||
else:
|
else:
|
||||||
init_kwargs["attn_implementation"] = "flash_attention_2"
|
setattr(config, "_attn_implementation", "flash_attention_2")
|
||||||
else:
|
else:
|
||||||
init_kwargs["attn_implementation"] = "eager"
|
setattr(config, "_attn_implementation", "eager")
|
||||||
|
|
||||||
|
|
||||||
def _configure_rope(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None:
|
def _configure_rope(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None:
|
||||||
|
@ -295,7 +293,7 @@ def patch_config(
|
||||||
if model_args.compute_dtype is None: # priority: bf16 > fp16 > fp32
|
if model_args.compute_dtype is None: # priority: bf16 > fp16 > fp32
|
||||||
model_args.compute_dtype = infer_optim_dtype(model_dtype=getattr(config, "torch_dtype", None))
|
model_args.compute_dtype = infer_optim_dtype(model_dtype=getattr(config, "torch_dtype", None))
|
||||||
|
|
||||||
_configure_attn_implementation(config, model_args, init_kwargs)
|
_configure_attn_implementation(config, model_args)
|
||||||
_configure_rope(config, model_args, is_trainable)
|
_configure_rope(config, model_args, is_trainable)
|
||||||
_configure_longlora(config, model_args, is_trainable)
|
_configure_longlora(config, model_args, is_trainable)
|
||||||
_configure_quantization(config, tokenizer, model_args, init_kwargs)
|
_configure_quantization(config, tokenizer, model_args, init_kwargs)
|
||||||
|
|
Loading…
Reference in New Issue