History

wql faa909dcc3 add: add mindie file		2024-09-10 15:38:33 +08:00
..
README.md	add: add mindie file	2024-09-10 15:38:33 +08:00
convert_quant_weights.py	add: add mindie file	2024-09-10 15:38:33 +08:00
run_fa.sh	add: add mindie file	2024-09-10 15:38:33 +08:00
run_pa.sh	add: add mindie file	2024-09-10 15:38:33 +08:00

README.md

README

LLaMA（Large Language Model Meta AI）和 LLaMA2（Large Language Model Meta AI 2），是由 Meta AI 发布的一个开放且高效的大型基础语言模型，可以通过自然语言交互的方式提供知识、文本生成、语言翻译、语言理解、代码编写和解释等任务。
此代码仓中实现了一套基于NPU硬件的LLaMa推理模型。配合加速库使用，旨在NPU上获得极致的推理性能。

特性矩阵

此矩阵罗列了各LLaMa模型支持的特性

模型及参数量	800I A2 Tensor Parallelism	300I DUO Tensor Parallelism	FP16	BF16（仅800I A2支持）	Flash Attention	Paged Attention	W8A8量化	W8A16量化	KV cache量化	稀疏量化（仅300I DUO支持）	MOE	MindIE	TGI	长序列
LLaMa-7B	支持world size 1,2,4,8	支持world size 2,4	√	√	√	√	×	×	×	×	×	√	√	×
LLaMa-13B	支持world size 1,2,4,8	支持world size 2,4	√	√	√	√	×	×	×	×	×	√	√	×
LLaMa-33B	支持world size 4,8	支持world size 2,4	√	√	√	√	×	×	×	√	×	×	×	×
LLaMa-65B	支持world size 8	×	√	√	√	√	×	√	√	×	×	√	×	×
LLaMa2-7B	支持world size 1,2,4,8	支持world size 2,4	√	√	√	√	√	×	×	√	×	√	√	×
LLaMa2-13B	支持world size 1,2,4,8	支持world size 2,4	√	√	√	√	√	×	×	√	×	√	√	×
LLaMa2-70B	支持world size 8	×	√	√	√	√	√	√	×	×	×	√	√	×

此模型仓已适配的模型版本
- LLaMa系列
- LLaMa2系列

使用说明

路径变量解释

变量名	含义
working_dir	加速库及模型库下载后放置的目录
llm_path	模型仓所在路径。若使用编译好的包，则路径为`${working_dir}/MindIE-LLM/`；若使用gitee下载的代码，则路径为`${working_dir}/MindIE-LLM/examples/atb_models`
script_path	脚本所在路径；LLaMa和LLaMa2的工作脚本所在路径为`${llm_path}/examples/models/llama`
weight_path	模型权重路径

权重

权重下载

权重转换

参考此README文件

量化权重生成

基于原始的浮点权重，使用量化工具，将高位浮点数转为低位的定点数。
注意事项：
- model_path和save_directory请勿使用同一个文件夹，避免浮点权重和量化权重混淆
- NPU多卡量化注意事项和环境要求见此README中的【NPU多卡量化】章节

W8A8 Antioutlier量化权重请使用以下指令生成

LLaMa2-7B/13B推荐使用W8A8 Antioulier量化
执行量化脚本

# 设置CANN包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd ${llm_path}
# 生成llama2-7b量化权重，antioutlier使用m1算法配置
python examples/models/llama/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A8量化权重路径} --w_bit 8 --a_bit 8 --disable_level L0 --device_type cpu --anti_method m1 --act_method 1 --calib_file ${llm_path}/examples/convert/model_slim/boolq.jsonl
# 生成llama2-13b量化权重，antioutlier使用m2算法配置
python examples/models/llama/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A8量化权重路径} --w_bit 8 --a_bit 8 --disable_level L0 --device_type cpu --anti_method m2 --act_method 1 --calib_file ${llm_path}/examples/convert/model_slim/boolq.jsonl

W8A8量化权重请使用以下指令生成

大参数量模型LLaMa2-70B推荐使用NPU多卡W8A8量化

执行量化脚本

# 指定当前机器上可用的逻辑NPU核心
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
cd ${llm_path}
python examples/models/llama/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A8量化权重路径} --w_bit 8 --a_bit 8 --disable_level L5 --device_type npu --calib_file ${llm_path}/examples/convert/model_slim/boolq.jsonl

W8A16量化权重请使用以下指令生成

当前仅LLaMa-65B、LLaMa2-70B支持W8A16量化

# 设置CANN包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd ${llm_path}
python examples/models/llama/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A16量化权重路径} --w_bit 8 --a_bit 16 --act_method 3 --calib_file ""

稀疏量化权重请使用以下指令生成

Step 1

# 设置CANN包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd ${llm_path}
python examples/models/llama/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A8S量化权重路径} --w_bit 4 --a_bit 8 --calib_file ${llm_path}/examples/convert/model_slim/teacher_qualification.jsonl --fraction 0.011 --co_sparse True

Llama33B稀疏量化权重请使用以下指令生成

python examples/models/llama/convert_quant_weight.py --model_path {浮点权重路径} --save_directory {W8A8S量化权重路径,默认为当前路径} --calib_file ${llm_path}/examples/convert/model_slim/boolq.jsonl --act_method 2 --do_smooth True --use_sigma True --is_lowbit True --co_sparse True --w_bit 4

Step 2：量化权重切分及压缩

运行前需要确保压缩工具编译过

cd /usr/local/Ascend/ascend-toolkit/latest/python/site-packages/msmodelslim/pytorch/weight_compression/compress_graph

bash build.sh /usr/local/Ascend/ascend-toolkit/latest

torchrun --nproc_per_node {TP数} -m examples.convert.model_slim.sparse_compressor --model_path {W8A8S量化权重路径} --save_directory {W8A8SC量化权重路径}

TP数为tensor parallel并行个数
注意：若权重生成时以TP=4进行切分，则运行时也需以TP=4运行

示例

  torchrun --nproc_per_node 2 -m examples.convert.model_slim.sparse_compressor --model_path /data1/weights/model_slim/llama2-7b_w8a8s --save_directory /data1/weights/model_slim/llama2-7b_w8a8sc

KV cache量化权重请使用以下指令生成

当前仅LLaMa-65B W8A8量化支持搭配KV cache量化

执行量化脚本

# 指定当前机器上可用的逻辑NPU核心
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
cd ${llm_path}
python examples/models/llama/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A8量化权重路径} --w_bit 8 --a_bit 8 --disable_level L5 --device_type npu --calib_file ${llm_path}/examples/convert/model_slim/boolq.jsonl --use_kvcache_quant True

相比于W8A8量化，需额外配置use_kvcache_quant参数

LLaMa 33B权重添加Special token

LLaMa 33B中tokenizer原始的special token为空，需手动将权重文件中的special_tokens_map.json文件替换成以下内容

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "model_max_length": 2048,
  "pad_token": null,
  "sp_model_kwargs": {},
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

基础环境变量

参考此README文件

推理

对话测试

运行Flash Attention FP16

其余LLaMa模型参考以下运行方式
- 运行启动脚本
  - 在${llm_path}目录下执行以下指令
```
bash ${script_path}/run_fa.sh ${weight_path}
```
- 环境变量说明
  - export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    - 指定当前机器上可用的逻辑NPU核心，多个核心间使用逗号相连
    - 核心ID查阅方式见此README文件的【启动脚本相关环境变量】章节
    - 对于300I DUO卡而言，若要使用单卡双芯，请指定至少两个可见核心；若要使用双卡四芯，请指定至少四个可见核心
    - 各模型支持的核心数参考“特性矩阵”
  - export MASTER_PORT=20030
    - 设置卡间通信端口
    - 默认使用20030端口
    - 目的是为了避免同一台机器同时运行多个多卡模型时出现通信冲突
    - 设置时端口建议范围为：20000-20050
  - 以下环境变量与性能和内存优化相关，通常情况下无需修改
```
export ATB_LAYER_INTERNAL_TENSOR_REUSE=1
export INF_NAN_MODE_ENABLE=0
export LCCL_ENABLE_FALLBACK=1
export INT8_FORMAT_NZ_ENABLE=1
```

运行Flash Attention BF16

暂不支持

运行Flash Attention W8A8

运行启动脚本
- 与“运行Flash Attention FP16”的启动方式相同
- ${weight_path}为W8A8量化权重的路径
环境变量说明
- 参见“运行Flash Attention FP16”中的环境变量说明
相比于FP16，运行量化时需修改W8A8量化权重${weight_path}/config.json中的quantize字段，将此字段对应的值修改为w8a8
- 若config.json中无此字段，则新增

运行Flash Attention W8A16

运行启动脚本
- 与“运行Flash Attention FP16”的启动方式相同
- ${weight_path}为W8A16量化权重的路径
环境变量说明
- 参见“运行Flash Attention FP16”中的环境变量说明

运行Paged Attention FP16

运行启动脚本
- 在${llm_path}目录下执行以下指令
```
bash ${script_path}/run_pa.sh ${weight_path}
```
环境变量说明
- export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  - 指定当前机器上可用的逻辑NPU核心，多个核心间使用逗号相连
  - 核心ID查阅方式见此README文件的【启动脚本相关环境变量】章节
  - 对于300I DUO卡而言，若要使用单卡双芯，请指定至少两个可见核心；若要使用双卡四芯，请指定至少四个可见核心
  - 各模型支持的核心数参考“特性矩阵”
- export MASTER_PORT=20030
  - 设置卡间通信端口
  - 默认使用20030端口
  - 目的是为了避免同一台机器同时运行多个多卡模型时出现通信冲突
  - 设置时端口建议范围为：20000-20050
- 以下环境变量与性能和内存优化相关，通常情况下无需修改
```
export ATB_LAYER_INTERNAL_TENSOR_REUSE=1
export INF_NAN_MODE_ENABLE=0
export LCCL_ENABLE_FALLBACK=1
export INT8_FORMAT_NZ_ENABLE=1
```

运行Paged Attention BF16

运行启动脚本
- 与“运行Paged Attention FP16”的启动方式相同
环境变量说明
- 参见“运行Paged Attention FP16”中的环境变量说明
相比于FP16，运行BF16时需修改${weight_path}/config.json中的torch_dtype字段，将此字段对应的值修改为bfloat16
300I DUO卡暂不支持BF16特性

运行Paged Attention W8A8

运行启动脚本
- 与“运行Paged Attention FP16”的启动方式相同
- ${weight_path}为W8A8量化权重的路径
环境变量说明
- 参见“运行Paged Attention FP16”中的环境变量说明
相比于FP16，运行量化时需修改W8A8量化权重${weight_path}/config.json中的quantize字段，将此字段对应的值修改为w8a8
- 若config.json中无此字段，则新增

运行Paged Attention W8A16

运行启动脚本
- 与“运行Paged Attention FP16”的启动方式相同
- ${weight_path}为W8A16量化权重的路径
环境变量说明
- 参见“运行Paged Attention FP16”中的环境变量说明

运行KV cache量化

运行启动脚本
- 与“运行Paged Attention FP16”的启动方式相同
- ${weight_path}为KV cache量化权重的路径
环境变量说明
- 参见“运行Paged Attention FP16”中的环境变量说明
相比于FP16，运行量化时需修改KV cache量化权重${weight_path}/config.json中的kv_quant字段，将此字段对应的值修改为C8
- 若config.json中无此字段，则新增

运行稀疏量化

运行启动脚本
- 与“运行Paged Attention FP16”的启动方式相同
- ${weight_path}为W8A8量化权重的路径
环境变量说明
- 参见“运行Paged Attention FP16”中的环境变量说明
相比于FP16，运行量化时需修改W8A8量化权重${weight_path}/config.json中的quantize字段，将此字段对应的值修改为w8a8sc
- 若config.json中无此字段，则新增
注意：压缩算法与硬件强相关，当前仅300I DUO卡支持稀疏量化

精度测试

参考此README文件

示例

cd ${llm_path}/tests/modeltest
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
bash run.sh pa_fp16 full_BoolQ 1 llama ${llama2-7b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 llama ${llama2-13b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 llama ${llama2-70b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 llama ${llama-7b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 llama ${llama-13b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 llama ${llama-65b权重路径} 8

运行量化权重和BF16时需注意${weight_path}/config.json中的quantize字段和torch_dtype字段是否与权重匹配，参考此README文件

性能测试