Merge branch 'main' of https://github.com/BUAADreamer/LLaMA-Factory
This commit is contained in:
commit
3f4556454c
25
README.md
25
README.md
|
@ -70,14 +70,16 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
|
||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
|
[24/05/14] We supported training and inference on the Ascend NPU devices. Check [installation](#installation) section for details.
|
||||||
|
|
||||||
[24/05/13] We supported fine-tuning the **Yi-1.5** series models.
|
[24/05/13] We supported fine-tuning the **Yi-1.5** series models.
|
||||||
|
|
||||||
[24/04/26] We supported fine-tuning the **LLaVA-1.5** multimodal LLMs. See [examples](examples/README.md) for usage.
|
[24/04/26] We supported fine-tuning the **LLaVA-1.5** multimodal LLMs. See [examples](examples/README.md) for usage.
|
||||||
|
|
||||||
[24/04/22] We provided a **[Colab notebook](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)** for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) and [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese) for details.
|
|
||||||
|
|
||||||
<details><summary>Full Changelog</summary>
|
<details><summary>Full Changelog</summary>
|
||||||
|
|
||||||
|
[24/04/22] We provided a **[Colab notebook](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)** for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) and [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese) for details.
|
||||||
|
|
||||||
[24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See [examples](examples/README.md) for usage.
|
[24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See [examples](examples/README.md) for usage.
|
||||||
|
|
||||||
[24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See [examples](examples/README.md) for usage.
|
[24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See [examples](examples/README.md) for usage.
|
||||||
|
@ -328,7 +330,7 @@ Extra dependencies available: torch, metrics, deepspeed, bitsandbytes, vllm, gal
|
||||||
|
|
||||||
<details><summary>For Windows users</summary>
|
<details><summary>For Windows users</summary>
|
||||||
|
|
||||||
If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of `bitsandbytes` library, which supports CUDA 11.1 to 12.2, please select the appropriate [release version](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) based on your CUDA version.
|
If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of `bitsandbytes` library, which supports CUDA 11.1 to 12.2, please select the appropriate [release version](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) based on your CUDA version.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
|
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
|
||||||
|
@ -338,6 +340,23 @@ To enable FlashAttention-2 on the Windows platform, you need to install the prec
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
<details><summary>For Ascend NPU users</summary>
|
||||||
|
|
||||||
|
To utilize Ascend NPU devices for (distributed) training and inference, you need to install the **[torch-npu](https://gitee.com/ascend/pytorch)** package and the **[Ascend CANN Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**.
|
||||||
|
|
||||||
|
| Requirement | Minimum | Recommend |
|
||||||
|
| ------------ | ------- | --------- |
|
||||||
|
| CANN | 8.0.RC1 | 8.0.RC1 |
|
||||||
|
| torch | 2.2.0 | 2.2.0 |
|
||||||
|
| torch-npu | 2.2.0 | 2.2.0 |
|
||||||
|
| deepspeed | 0.13.2 | 0.13.2 |
|
||||||
|
|
||||||
|
Remember to use `ASCEND_RT_VISIBLE_DEVICES` instead of `CUDA_VISIBLE_DEVICES` to specify the device to use.
|
||||||
|
|
||||||
|
If you cannot infer model on NPU devices, try setting `do_sample: false` in the configurations.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
### Data Preparation
|
### Data Preparation
|
||||||
|
|
||||||
Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.
|
Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.
|
||||||
|
|
23
README_zh.md
23
README_zh.md
|
@ -70,14 +70,16 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
|
||||||
|
|
||||||
## 更新日志
|
## 更新日志
|
||||||
|
|
||||||
|
[24/05/14] 我们支持了昇腾 NPU 设备的训练和推理。详情请查阅[安装](#安装-llama-factory)部分。
|
||||||
|
|
||||||
[24/05/13] 我们支持了 Yi-1.5 系列模型的微调。
|
[24/05/13] 我们支持了 Yi-1.5 系列模型的微调。
|
||||||
|
|
||||||
[24/04/26] 我们支持了多模态模型 **LLaVA-1.5** 的微调。详细用法请参照 [examples](examples/README_zh.md)。
|
[24/04/26] 我们支持了多模态模型 **LLaVA-1.5** 的微调。详细用法请参照 [examples](examples/README_zh.md)。
|
||||||
|
|
||||||
[24/04/22] 我们提供了在免费 T4 GPU 上微调 Llama-3 模型的 **[Colab 笔记本](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)**。Hugging Face 社区公开了两个利用 LLaMA Factory 微调的 Llama-3 模型,详情请见 [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) 和 [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese)。
|
|
||||||
|
|
||||||
<details><summary>展开日志</summary>
|
<details><summary>展开日志</summary>
|
||||||
|
|
||||||
|
[24/04/22] 我们提供了在免费 T4 GPU 上微调 Llama-3 模型的 **[Colab 笔记本](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)**。Hugging Face 社区公开了两个利用 LLaMA Factory 微调的 Llama-3 模型,详情请见 [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) 和 [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese)。
|
||||||
|
|
||||||
[24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 [examples](examples/README_zh.md)。
|
[24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 [examples](examples/README_zh.md)。
|
||||||
|
|
||||||
[24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 [examples](examples/README_zh.md)。
|
[24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 [examples](examples/README_zh.md)。
|
||||||
|
@ -338,6 +340,23 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
<details><summary>昇腾 NPU 用户指南</summary>
|
||||||
|
|
||||||
|
如果使用昇腾 NPU 设备进行(分布式)训练或推理,需要安装 **[torch-npu](https://gitee.com/ascend/pytorch)** 库和 **[Ascend CANN Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**。
|
||||||
|
|
||||||
|
| 依赖项 | 至少 | 推荐 |
|
||||||
|
| ------------ | ------- | --------- |
|
||||||
|
| CANN | 8.0.RC1 | 8.0.RC1 |
|
||||||
|
| torch | 2.2.0 | 2.2.0 |
|
||||||
|
| torch-npu | 2.2.0 | 2.2.0 |
|
||||||
|
| deepspeed | 0.13.2 | 0.13.2 |
|
||||||
|
|
||||||
|
请记得使用 `ASCEND_RT_VISIBLE_DEVICES` 而非 `CUDA_VISIBLE_DEVICES` 来指定您使用的设备。
|
||||||
|
|
||||||
|
如果遇到无法正常推理的情况,请尝试设置 `do_sample: false`。
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
### 数据准备
|
### 数据准备
|
||||||
|
|
||||||
关于数据集文件的格式,请参考 [data/README_zh.md](data/README_zh.md) 的内容。你可以使用 HuggingFace / ModelScope 上的数据集或加载本地数据集。
|
关于数据集文件的格式,请参考 [data/README_zh.md](data/README_zh.md) 的内容。你可以使用 HuggingFace / ModelScope 上的数据集或加载本地数据集。
|
||||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 186 KiB After Width: | Height: | Size: 145 KiB |
|
@ -7,6 +7,7 @@ Make sure to execute these commands in the `LLaMA-Factory` directory.
|
||||||
- [LoRA Fine-Tuning on A Single GPU](#lora-fine-tuning-on-a-single-gpu)
|
- [LoRA Fine-Tuning on A Single GPU](#lora-fine-tuning-on-a-single-gpu)
|
||||||
- [QLoRA Fine-Tuning on a Single GPU](#qlora-fine-tuning-on-a-single-gpu)
|
- [QLoRA Fine-Tuning on a Single GPU](#qlora-fine-tuning-on-a-single-gpu)
|
||||||
- [LoRA Fine-Tuning on Multiple GPUs](#lora-fine-tuning-on-multiple-gpus)
|
- [LoRA Fine-Tuning on Multiple GPUs](#lora-fine-tuning-on-multiple-gpus)
|
||||||
|
- [LoRA Fine-Tuning on Multiple NPUs](#lora-fine-tuning-on-multiple-npus)
|
||||||
- [Full-Parameter Fine-Tuning on Multiple GPUs](#full-parameter-fine-tuning-on-multiple-gpus)
|
- [Full-Parameter Fine-Tuning on Multiple GPUs](#full-parameter-fine-tuning-on-multiple-gpus)
|
||||||
- [Merging LoRA Adapters and Quantization](#merging-lora-adapters-and-quantization)
|
- [Merging LoRA Adapters and Quantization](#merging-lora-adapters-and-quantization)
|
||||||
- [Inferring LoRA Fine-Tuned Models](#inferring-lora-fine-tuned-models)
|
- [Inferring LoRA Fine-Tuned Models](#inferring-lora-fine-tuned-models)
|
||||||
|
@ -124,6 +125,14 @@ bash examples/lora_multi_gpu/multi_node.sh
|
||||||
bash examples/lora_multi_gpu/ds_zero3.sh
|
bash examples/lora_multi_gpu/ds_zero3.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### LoRA Fine-Tuning on Multiple NPUs
|
||||||
|
|
||||||
|
#### Supervised Fine-Tuning with DeepSpeed ZeRO-0
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash examples/lora_multi_npu/ds_zero0.sh
|
||||||
|
```
|
||||||
|
|
||||||
### Full-Parameter Fine-Tuning on Multiple GPUs
|
### Full-Parameter Fine-Tuning on Multiple GPUs
|
||||||
|
|
||||||
#### Supervised Fine-Tuning with Accelerate on Single Node
|
#### Supervised Fine-Tuning with Accelerate on Single Node
|
||||||
|
|
|
@ -7,6 +7,7 @@
|
||||||
- [单 GPU LoRA 微调](#单-gpu-lora-微调)
|
- [单 GPU LoRA 微调](#单-gpu-lora-微调)
|
||||||
- [单 GPU QLoRA 微调](#单-gpu-qlora-微调)
|
- [单 GPU QLoRA 微调](#单-gpu-qlora-微调)
|
||||||
- [多 GPU LoRA 微调](#多-gpu-lora-微调)
|
- [多 GPU LoRA 微调](#多-gpu-lora-微调)
|
||||||
|
- [多 NPU LoRA 微调](#多-npu-lora-微调)
|
||||||
- [多 GPU 全参数微调](#多-gpu-全参数微调)
|
- [多 GPU 全参数微调](#多-gpu-全参数微调)
|
||||||
- [合并 LoRA 适配器与模型量化](#合并-lora-适配器与模型量化)
|
- [合并 LoRA 适配器与模型量化](#合并-lora-适配器与模型量化)
|
||||||
- [推理 LoRA 模型](#推理-lora-模型)
|
- [推理 LoRA 模型](#推理-lora-模型)
|
||||||
|
@ -124,6 +125,14 @@ bash examples/lora_multi_gpu/multi_node.sh
|
||||||
bash examples/lora_multi_gpu/ds_zero3.sh
|
bash examples/lora_multi_gpu/ds_zero3.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### 多 NPU LoRA 微调
|
||||||
|
|
||||||
|
#### 使用 DeepSpeed ZeRO-0 训练
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash examples/lora_multi_npu/ds_zero0.sh
|
||||||
|
```
|
||||||
|
|
||||||
### 多 GPU 全参数微调
|
### 多 GPU 全参数微调
|
||||||
|
|
||||||
#### 使用 DeepSpeed 进行单节点训练
|
#### 使用 DeepSpeed 进行单节点训练
|
||||||
|
|
|
@ -0,0 +1,28 @@
|
||||||
|
{
|
||||||
|
"train_batch_size": "auto",
|
||||||
|
"train_micro_batch_size_per_gpu": "auto",
|
||||||
|
"gradient_accumulation_steps": "auto",
|
||||||
|
"gradient_clipping": "auto",
|
||||||
|
"zero_allow_untested_optimizer": true,
|
||||||
|
"fp16": {
|
||||||
|
"enabled": "auto",
|
||||||
|
"loss_scale": 0,
|
||||||
|
"loss_scale_window": 1000,
|
||||||
|
"initial_scale_power": 16,
|
||||||
|
"hysteresis": 2,
|
||||||
|
"min_loss_scale": 1
|
||||||
|
},
|
||||||
|
"bf16": {
|
||||||
|
"enabled": "auto"
|
||||||
|
},
|
||||||
|
"zero_optimization": {
|
||||||
|
"stage": 0,
|
||||||
|
"allgather_partitions": true,
|
||||||
|
"allgather_bucket_size": 5e8,
|
||||||
|
"overlap_comm": true,
|
||||||
|
"reduce_scatter": true,
|
||||||
|
"reduce_bucket_size": 5e8,
|
||||||
|
"contiguous_gradients": true,
|
||||||
|
"round_robin_gradients": true
|
||||||
|
}
|
||||||
|
}
|
|
@ -6,7 +6,7 @@ RANK=0
|
||||||
MASTER_ADDR=192.168.0.1
|
MASTER_ADDR=192.168.0.1
|
||||||
MASTER_PORT=29500
|
MASTER_PORT=29500
|
||||||
|
|
||||||
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run \
|
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
|
||||||
--nproc_per_node $NPROC_PER_NODE \
|
--nproc_per_node $NPROC_PER_NODE \
|
||||||
--nnodes $NNODES \
|
--nnodes $NNODES \
|
||||||
--node_rank $RANK \
|
--node_rank $RANK \
|
||||||
|
|
|
@ -1,9 +1,15 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
NPROC_PER_NODE=4
|
NPROC_PER_NODE=4
|
||||||
|
NNODES=1
|
||||||
|
RANK=0
|
||||||
|
MASTER_ADDR=127.0.0.1
|
||||||
|
MASTER_PORT=29500
|
||||||
|
|
||||||
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run \
|
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
|
||||||
--nproc_per_node $NPROC_PER_NODE \
|
--nproc_per_node $NPROC_PER_NODE \
|
||||||
--nnodes 1 \
|
--nnodes $NNODES \
|
||||||
--standalone \
|
--node_rank $RANK \
|
||||||
|
--master_addr $MASTER_ADDR \
|
||||||
|
--master_port $MASTER_PORT \
|
||||||
src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
|
src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
|
||||||
|
|
|
@ -1,9 +1,15 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
NPROC_PER_NODE=4
|
NPROC_PER_NODE=4
|
||||||
|
NNODES=1
|
||||||
|
RANK=0
|
||||||
|
MASTER_ADDR=127.0.0.1
|
||||||
|
MASTER_PORT=29500
|
||||||
|
|
||||||
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run \
|
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
|
||||||
--nproc_per_node $NPROC_PER_NODE \
|
--nproc_per_node $NPROC_PER_NODE \
|
||||||
--nnodes 1 \
|
--nnodes $NNODES \
|
||||||
--standalone \
|
--node_rank $RANK \
|
||||||
|
--master_addr $MASTER_ADDR \
|
||||||
|
--master_port $MASTER_PORT \
|
||||||
src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
|
src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
|
||||||
|
|
|
@ -0,0 +1,15 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
NPROC_PER_NODE=4
|
||||||
|
NNODES=1
|
||||||
|
RANK=0
|
||||||
|
MASTER_ADDR=127.0.0.1
|
||||||
|
MASTER_PORT=29500
|
||||||
|
|
||||||
|
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 torchrun \
|
||||||
|
--nproc_per_node $NPROC_PER_NODE \
|
||||||
|
--nnodes $NNODES \
|
||||||
|
--node_rank $RANK \
|
||||||
|
--master_addr $MASTER_ADDR \
|
||||||
|
--master_port $MASTER_PORT \
|
||||||
|
src/train.py examples/lora_multi_npu/llama3_lora_sft_ds.yaml
|
|
@ -0,0 +1,42 @@
|
||||||
|
# model
|
||||||
|
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
|
|
||||||
|
# method
|
||||||
|
stage: sft
|
||||||
|
do_train: true
|
||||||
|
finetuning_type: lora
|
||||||
|
lora_target: q_proj,v_proj
|
||||||
|
|
||||||
|
# ddp
|
||||||
|
ddp_timeout: 180000000
|
||||||
|
deepspeed: examples/deepspeed/ds_z0_config.json
|
||||||
|
|
||||||
|
# dataset
|
||||||
|
dataset: identity,alpaca_gpt4_en
|
||||||
|
template: llama3
|
||||||
|
cutoff_len: 1024
|
||||||
|
max_samples: 1000
|
||||||
|
overwrite_cache: true
|
||||||
|
preprocessing_num_workers: 16
|
||||||
|
|
||||||
|
# output
|
||||||
|
output_dir: saves/llama3-8b/lora/sft
|
||||||
|
logging_steps: 10
|
||||||
|
save_steps: 500
|
||||||
|
plot_loss: true
|
||||||
|
overwrite_output_dir: true
|
||||||
|
|
||||||
|
# train
|
||||||
|
per_device_train_batch_size: 1
|
||||||
|
gradient_accumulation_steps: 2
|
||||||
|
learning_rate: 0.0001
|
||||||
|
num_train_epochs: 3.0
|
||||||
|
lr_scheduler_type: cosine
|
||||||
|
warmup_steps: 0.1
|
||||||
|
fp16: true
|
||||||
|
|
||||||
|
# eval
|
||||||
|
val_size: 0.1
|
||||||
|
per_device_eval_batch_size: 1
|
||||||
|
evaluation_strategy: steps
|
||||||
|
eval_steps: 500
|
|
@ -51,7 +51,7 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
|
||||||
allow_methods=["*"],
|
allow_methods=["*"],
|
||||||
allow_headers=["*"],
|
allow_headers=["*"],
|
||||||
)
|
)
|
||||||
api_key = os.environ.get("API_KEY", None)
|
api_key = os.environ.get("API_KEY")
|
||||||
security = HTTPBearer(auto_error=False)
|
security = HTTPBearer(auto_error=False)
|
||||||
|
|
||||||
async def verify_api_key(auth: Annotated[Optional[HTTPAuthorizationCredentials], Depends(security)]):
|
async def verify_api_key(auth: Annotated[Optional[HTTPAuthorizationCredentials], Depends(security)]):
|
||||||
|
|
|
@ -65,12 +65,13 @@ class HuggingfaceEngine(BaseEngine):
|
||||||
prompt_length = len(prompt_ids)
|
prompt_length = len(prompt_ids)
|
||||||
inputs = torch.tensor([prompt_ids], device=model.device)
|
inputs = torch.tensor([prompt_ids], device=model.device)
|
||||||
|
|
||||||
do_sample = input_kwargs.pop("do_sample", None)
|
do_sample = input_kwargs.pop("do_sample", generating_args["do_sample"])
|
||||||
temperature = input_kwargs.pop("temperature", None)
|
temperature = input_kwargs.pop("temperature", generating_args["temperature"])
|
||||||
top_p = input_kwargs.pop("top_p", None)
|
top_p = input_kwargs.pop("top_p", generating_args["top_p"])
|
||||||
top_k = input_kwargs.pop("top_k", None)
|
top_k = input_kwargs.pop("top_k", generating_args["top_k"])
|
||||||
num_return_sequences = input_kwargs.pop("num_return_sequences", None)
|
num_return_sequences = input_kwargs.pop("num_return_sequences", 1)
|
||||||
repetition_penalty = input_kwargs.pop("repetition_penalty", None)
|
repetition_penalty = input_kwargs.pop("repetition_penalty", generating_args["repetition_penalty"])
|
||||||
|
length_penalty = input_kwargs.pop("length_penalty", generating_args["length_penalty"])
|
||||||
max_length = input_kwargs.pop("max_length", None)
|
max_length = input_kwargs.pop("max_length", None)
|
||||||
max_new_tokens = input_kwargs.pop("max_new_tokens", None)
|
max_new_tokens = input_kwargs.pop("max_new_tokens", None)
|
||||||
stop = input_kwargs.pop("stop", None)
|
stop = input_kwargs.pop("stop", None)
|
||||||
|
@ -78,14 +79,16 @@ class HuggingfaceEngine(BaseEngine):
|
||||||
if stop is not None:
|
if stop is not None:
|
||||||
raise ValueError("Stop parameter is not supported in Huggingface engine yet.")
|
raise ValueError("Stop parameter is not supported in Huggingface engine yet.")
|
||||||
|
|
||||||
|
generating_args = generating_args.copy()
|
||||||
generating_args.update(
|
generating_args.update(
|
||||||
dict(
|
dict(
|
||||||
do_sample=do_sample if do_sample is not None else generating_args["do_sample"],
|
do_sample=do_sample,
|
||||||
temperature=temperature or generating_args["temperature"],
|
temperature=temperature,
|
||||||
top_p=top_p or generating_args["top_p"],
|
top_p=top_p,
|
||||||
top_k=top_k or generating_args["top_k"],
|
top_k=top_k,
|
||||||
num_return_sequences=num_return_sequences or 1,
|
num_return_sequences=num_return_sequences,
|
||||||
repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
|
repetition_penalty=repetition_penalty,
|
||||||
|
length_penalty=length_penalty,
|
||||||
eos_token_id=[tokenizer.eos_token_id] + tokenizer.additional_special_tokens_ids,
|
eos_token_id=[tokenizer.eos_token_id] + tokenizer.additional_special_tokens_ids,
|
||||||
pad_token_id=tokenizer.pad_token_id,
|
pad_token_id=tokenizer.pad_token_id,
|
||||||
)
|
)
|
||||||
|
@ -94,6 +97,10 @@ class HuggingfaceEngine(BaseEngine):
|
||||||
if isinstance(num_return_sequences, int) and num_return_sequences > 1:
|
if isinstance(num_return_sequences, int) and num_return_sequences > 1:
|
||||||
generating_args["do_sample"] = True
|
generating_args["do_sample"] = True
|
||||||
|
|
||||||
|
if not generating_args["do_sample"]:
|
||||||
|
generating_args.pop("temperature", None)
|
||||||
|
generating_args.pop("top_p", None)
|
||||||
|
|
||||||
if max_length:
|
if max_length:
|
||||||
generating_args.pop("max_new_tokens", None)
|
generating_args.pop("max_new_tokens", None)
|
||||||
generating_args["max_length"] = max_length
|
generating_args["max_length"] = max_length
|
||||||
|
|
|
@ -89,43 +89,35 @@ class VllmEngine(BaseEngine):
|
||||||
)
|
)
|
||||||
prompt_length = len(prompt_ids)
|
prompt_length = len(prompt_ids)
|
||||||
|
|
||||||
temperature = input_kwargs.pop("temperature", None)
|
use_beam_search = self.generating_args["num_beams"] > 1
|
||||||
top_p = input_kwargs.pop("top_p", None)
|
temperature = input_kwargs.pop("temperature", self.generating_args["temperature"])
|
||||||
top_k = input_kwargs.pop("top_k", None)
|
top_p = input_kwargs.pop("top_p", self.generating_args["top_p"])
|
||||||
num_return_sequences = input_kwargs.pop("num_return_sequences", None)
|
top_k = input_kwargs.pop("top_k", self.generating_args["top_k"])
|
||||||
repetition_penalty = input_kwargs.pop("repetition_penalty", None)
|
num_return_sequences = input_kwargs.pop("num_return_sequences", 1)
|
||||||
|
repetition_penalty = input_kwargs.pop("repetition_penalty", self.generating_args["repetition_penalty"])
|
||||||
|
length_penalty = input_kwargs.pop("length_penalty", self.generating_args["length_penalty"])
|
||||||
max_length = input_kwargs.pop("max_length", None)
|
max_length = input_kwargs.pop("max_length", None)
|
||||||
max_new_tokens = input_kwargs.pop("max_new_tokens", None)
|
max_new_tokens = input_kwargs.pop("max_new_tokens", None)
|
||||||
stop = input_kwargs.pop("stop", None)
|
stop = input_kwargs.pop("stop", None)
|
||||||
|
|
||||||
generating_args = self.generating_args.copy()
|
max_tokens = self.generating_args["max_new_tokens"] or self.generating_args["max_length"]
|
||||||
generating_args.update(
|
|
||||||
dict(
|
|
||||||
temperature=temperature or generating_args["temperature"],
|
|
||||||
top_p=top_p or generating_args["top_p"],
|
|
||||||
top_k=top_k or generating_args["top_k"],
|
|
||||||
num_return_sequences=num_return_sequences or 1,
|
|
||||||
repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
if max_length:
|
if max_length:
|
||||||
generating_args["max_new_tokens"] = max_length - prompt_length
|
max_tokens = max_length - prompt_length if max_length > prompt_length else 1
|
||||||
|
|
||||||
if max_new_tokens:
|
if max_new_tokens:
|
||||||
generating_args["max_new_tokens"] = max_new_tokens
|
max_tokens = max_new_tokens
|
||||||
|
|
||||||
sampling_params = SamplingParams(
|
sampling_params = SamplingParams(
|
||||||
n=generating_args["num_return_sequences"],
|
n=num_return_sequences,
|
||||||
repetition_penalty=generating_args["repetition_penalty"],
|
repetition_penalty=repetition_penalty,
|
||||||
temperature=generating_args["temperature"],
|
temperature=temperature,
|
||||||
top_p=generating_args["top_p"],
|
top_p=top_p,
|
||||||
top_k=generating_args["top_k"],
|
top_k=top_k,
|
||||||
use_beam_search=generating_args["num_beams"] > 1,
|
use_beam_search=use_beam_search,
|
||||||
length_penalty=generating_args["length_penalty"],
|
length_penalty=length_penalty,
|
||||||
stop=stop,
|
stop=stop,
|
||||||
stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
|
stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
|
||||||
max_tokens=generating_args["max_new_tokens"],
|
max_tokens=max_tokens,
|
||||||
skip_special_tokens=True,
|
skip_special_tokens=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
|
@ -53,7 +53,7 @@ class LogCallback(TrainerCallback):
|
||||||
self.aborted = False
|
self.aborted = False
|
||||||
self.do_train = False
|
self.do_train = False
|
||||||
""" Web UI """
|
""" Web UI """
|
||||||
self.webui_mode = bool(int(os.environ.get("LLAMABOARD_ENABLED", "0")))
|
self.webui_mode = os.environ.get("LLAMABOARD_ENABLED", "0").lower() in ["true", "1"]
|
||||||
if self.webui_mode:
|
if self.webui_mode:
|
||||||
signal.signal(signal.SIGABRT, self._set_abort)
|
signal.signal(signal.SIGABRT, self._set_abort)
|
||||||
self.logger_handler = LoggerHandler(output_dir)
|
self.logger_handler = LoggerHandler(output_dir)
|
||||||
|
|
|
@ -58,7 +58,7 @@ class AverageMeter:
|
||||||
|
|
||||||
|
|
||||||
def check_dependencies() -> None:
|
def check_dependencies() -> None:
|
||||||
if int(os.environ.get("DISABLE_VERSION_CHECK", "0")):
|
if os.environ.get("DISABLE_VERSION_CHECK", "0").lower() in ["true", "1"]:
|
||||||
logger.warning("Version checking has been disabled, may lead to unexpected behaviors.")
|
logger.warning("Version checking has been disabled, may lead to unexpected behaviors.")
|
||||||
else:
|
else:
|
||||||
require_version("transformers>=4.37.2", "To fix: pip install transformers>=4.37.2")
|
require_version("transformers>=4.37.2", "To fix: pip install transformers>=4.37.2")
|
||||||
|
|
|
@ -21,6 +21,9 @@ def smooth(scalars: List[float]) -> List[float]:
|
||||||
r"""
|
r"""
|
||||||
EMA implementation according to TensorBoard.
|
EMA implementation according to TensorBoard.
|
||||||
"""
|
"""
|
||||||
|
if len(scalars) == 0:
|
||||||
|
return []
|
||||||
|
|
||||||
last = scalars[0]
|
last = scalars[0]
|
||||||
smoothed = []
|
smoothed = []
|
||||||
weight = 1.8 * (1 / (1 + math.exp(-0.05 * len(scalars))) - 0.5) # a sigmoid function
|
weight = 1.8 * (1 / (1 + math.exp(-0.05 * len(scalars))) - 0.5) # a sigmoid function
|
||||||
|
@ -32,6 +35,9 @@ def smooth(scalars: List[float]) -> List[float]:
|
||||||
|
|
||||||
|
|
||||||
def gen_loss_plot(trainer_log: List[Dict[str, Any]]) -> "matplotlib.figure.Figure":
|
def gen_loss_plot(trainer_log: List[Dict[str, Any]]) -> "matplotlib.figure.Figure":
|
||||||
|
r"""
|
||||||
|
Plots loss curves in LlamaBoard.
|
||||||
|
"""
|
||||||
plt.close("all")
|
plt.close("all")
|
||||||
plt.switch_backend("agg")
|
plt.switch_backend("agg")
|
||||||
fig = plt.figure()
|
fig = plt.figure()
|
||||||
|
@ -51,6 +57,9 @@ def gen_loss_plot(trainer_log: List[Dict[str, Any]]) -> "matplotlib.figure.Figur
|
||||||
|
|
||||||
|
|
||||||
def plot_loss(save_dictionary: os.PathLike, keys: List[str] = ["loss"]) -> None:
|
def plot_loss(save_dictionary: os.PathLike, keys: List[str] = ["loss"]) -> None:
|
||||||
|
r"""
|
||||||
|
Plots loss curves and saves the image.
|
||||||
|
"""
|
||||||
plt.switch_backend("agg")
|
plt.switch_backend("agg")
|
||||||
with open(os.path.join(save_dictionary, TRAINER_STATE_NAME), "r", encoding="utf-8") as f:
|
with open(os.path.join(save_dictionary, TRAINER_STATE_NAME), "r", encoding="utf-8") as f:
|
||||||
data = json.load(f)
|
data = json.load(f)
|
||||||
|
|
|
@ -1,9 +1,10 @@
|
||||||
|
import os
|
||||||
from types import MethodType
|
from types import MethodType
|
||||||
from typing import TYPE_CHECKING, Any, Dict
|
from typing import TYPE_CHECKING, Any, Dict
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
from peft import PeftModel
|
from peft import PeftModel
|
||||||
from transformers import PreTrainedModel, PreTrainedTokenizerBase
|
from transformers import PreTrainedModel, PreTrainedTokenizerBase, is_torch_npu_available
|
||||||
from transformers.integrations import is_deepspeed_zero3_enabled
|
from transformers.integrations import is_deepspeed_zero3_enabled
|
||||||
|
|
||||||
from ..extras.logging import get_logger
|
from ..extras.logging import get_logger
|
||||||
|
@ -44,6 +45,10 @@ def patch_config(
|
||||||
if model_args.compute_dtype is None: # priority: bf16 > fp16 > fp32
|
if model_args.compute_dtype is None: # priority: bf16 > fp16 > fp32
|
||||||
model_args.compute_dtype = infer_optim_dtype(model_dtype=getattr(config, "torch_dtype", None))
|
model_args.compute_dtype = infer_optim_dtype(model_dtype=getattr(config, "torch_dtype", None))
|
||||||
|
|
||||||
|
if is_torch_npu_available():
|
||||||
|
use_jit_compile = os.environ.get("JIT_COMPILE", "0").lower() in ["true", "1"]
|
||||||
|
torch.npu.set_compile_mode(jit_compile=use_jit_compile)
|
||||||
|
|
||||||
configure_attn_implementation(config, model_args)
|
configure_attn_implementation(config, model_args)
|
||||||
configure_rope(config, model_args, is_trainable)
|
configure_rope(config, model_args, is_trainable)
|
||||||
configure_longlora(config, model_args, is_trainable)
|
configure_longlora(config, model_args, is_trainable)
|
||||||
|
@ -57,7 +62,7 @@ def patch_config(
|
||||||
logger.info("Using KV cache for faster generation.")
|
logger.info("Using KV cache for faster generation.")
|
||||||
|
|
||||||
if getattr(config, "model_type", None) == "qwen":
|
if getattr(config, "model_type", None) == "qwen":
|
||||||
setattr(config, "use_flash_attn", model_args.flash_attn)
|
setattr(config, "use_flash_attn", model_args.flash_attn == "fa2")
|
||||||
for dtype_name, dtype in [("fp16", torch.float16), ("bf16", torch.bfloat16), ("fp32", torch.float32)]:
|
for dtype_name, dtype in [("fp16", torch.float16), ("bf16", torch.bfloat16), ("fp32", torch.float32)]:
|
||||||
setattr(config, dtype_name, model_args.compute_dtype == dtype)
|
setattr(config, dtype_name, model_args.compute_dtype == dtype)
|
||||||
|
|
||||||
|
|
|
@ -22,7 +22,7 @@ def configure_attn_implementation(config: "PretrainedConfig", model_args: "Model
|
||||||
|
|
||||||
elif model_args.flash_attn == "sdpa":
|
elif model_args.flash_attn == "sdpa":
|
||||||
if not is_sdpa_available():
|
if not is_sdpa_available():
|
||||||
logger.warning("Torch>=2.1.1 is required for SDPA attention.")
|
logger.warning("torch>=2.1.1 is required for SDPA attention.")
|
||||||
return
|
return
|
||||||
|
|
||||||
requested_attn_implementation = "sdpa"
|
requested_attn_implementation = "sdpa"
|
||||||
|
@ -52,4 +52,4 @@ def print_attn_implementation(config: "PretrainedConfig") -> None:
|
||||||
elif attn_implementation == "sdpa":
|
elif attn_implementation == "sdpa":
|
||||||
logger.info("Using torch SDPA for faster training and inference.")
|
logger.info("Using torch SDPA for faster training and inference.")
|
||||||
else:
|
else:
|
||||||
logger.info("Using vanilla Attention implementation.")
|
logger.info("Using vanilla attention implementation.")
|
||||||
|
|
|
@ -71,12 +71,12 @@ def create_web_demo() -> gr.Blocks:
|
||||||
|
|
||||||
|
|
||||||
def run_web_ui() -> None:
|
def run_web_ui() -> None:
|
||||||
gradio_share = bool(int(os.environ.get("GRADIO_SHARE", "0")))
|
gradio_share = os.environ.get("GRADIO_SHARE", "0").lower() in ["true", "1"]
|
||||||
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
||||||
create_ui().queue().launch(share=gradio_share, server_name=server_name)
|
create_ui().queue().launch(share=gradio_share, server_name=server_name)
|
||||||
|
|
||||||
|
|
||||||
def run_web_demo() -> None:
|
def run_web_demo() -> None:
|
||||||
gradio_share = bool(int(os.environ.get("GRADIO_SHARE", "0")))
|
gradio_share = os.environ.get("GRADIO_SHARE", "0").lower() in ["true", "1"]
|
||||||
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
||||||
create_web_demo().queue().launch(share=gradio_share, server_name=server_name)
|
create_web_demo().queue().launch(share=gradio_share, server_name=server_name)
|
||||||
|
|
|
@ -4,7 +4,7 @@ from llmtuner.webui.interface import create_ui
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
gradio_share = bool(int(os.environ.get("GRADIO_SHARE", "0")))
|
gradio_share = os.environ.get("GRADIO_SHARE", "0").lower() in ["true", "1"]
|
||||||
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
||||||
create_ui().queue().launch(share=gradio_share, server_name=server_name)
|
create_ui().queue().launch(share=gradio_share, server_name=server_name)
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue