InfiniTensor/examples/distributed/README.md

# 分布式脚本

## 英伟达平台运行方式

#### 1. 运行pytorch模型并生成输入和标准输出，可选择导出onnx

使用 `--export_onnx` 设置导出onnx的目录，默认为当前路径 `./`，不使用这个flag则只进行计算和生成输入输出。

```bash
python run_pytorch.py --model gpt2  --batch_size 1  --length 1 --export_onnx ./
```

会在当前目录下生成输入输出文件`test_inputs.npy` 和 `test_results.npy`，目前只支持单一输入输出。

#### 2. 运行InfiniTensor分布式脚本

```bash
python cuda_launch.py --model "/XXX/XXX.onnx" --nproc_per_node 4 
```

## 寒武纪平台运行方式

**将上述运行脚本 `run_pytorch.py` 以及 `cuda_launch.py` 针对寒武纪平台做了相应的适配，具体见 `run_pytorch_mlu.py` 以及 `bang_launch.py`。**

#### 1. 运行pytorch模型并生成输入和标准输出，可选择导出onnx

使用 `--export_onnx` 设置导出onnx的目录，默认为当前路径 `./`，不使用这个flag则只进行计算和生成输入输出。

```bash
python run_pytorch_mlu.py --model gpt2  --batch_size 1  --length 1 --export_onnx ./
```

会在当前目录下生成输入输出文件`test_inputs.npy` 和 `test_results.npy`，目前只支持单一输入输出。

#### 2. 运行InfiniTensor分布式脚本

```bash
python bang_launch.py --model "/XXX/XXX.onnx" --nproc_per_node 4 
```
-												针对bert和gpt2模型分布式推理的优化 (#221)

* fix(dist): 改善分布式脚本，只打印绝对误差

* feat(dist): 增加可导出onnx的pytorch运行脚本

* feat(front): 增加对Y值为-inf的where算子的图优化

* feat(kernel): 对b为常数的pow和div算子进行特判优化

* fix(front): 消除前端对global output形状信息的依赖，分布式脚本删除不必要的shape infer

* feat(kernel): 针对matmul中bias为行向量时的expand操作的特化优化

* fix(kernel): 删除div pow const中不必要的同步

* Update expand.cu

* fix: fix comments

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Derui Yang <ydrml@hotmail.com>
											
										
										
											2024-04-01 14:04:28 +08:00
+								# 分布式脚本
-												添加 MLU 平台分布式验收脚本 (#223)

* 添加 MLU 平台分布式验收脚本

* add fp16 test, fix cast

* fix

* add onnxsim for llama

* add matmul tf32 for mlu

* add submodule: onnxsim_large_model

* fix

* modified bang_launch.py, start_single

* add test for albert/opt

* change file path

---------

Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
											
										
										
											2024-04-28 11:24:09 +08:00
+								## 英伟达平台运行方式
-												针对bert和gpt2模型分布式推理的优化 (#221)

* fix(dist): 改善分布式脚本，只打印绝对误差

* feat(dist): 增加可导出onnx的pytorch运行脚本

* feat(front): 增加对Y值为-inf的where算子的图优化

* feat(kernel): 对b为常数的pow和div算子进行特判优化

* fix(front): 消除前端对global output形状信息的依赖，分布式脚本删除不必要的shape infer

* feat(kernel): 针对matmul中bias为行向量时的expand操作的特化优化

* fix(kernel): 删除div pow const中不必要的同步

* Update expand.cu

* fix: fix comments

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Derui Yang <ydrml@hotmail.com>
											
										
										
											2024-04-01 14:04:28 +08:00
+								#### 1. 运行pytorch模型并生成输入和标准输出，可选择导出onnx
 								使用 `--export_onnx` 设置导出onnx的目录，默认为当前路径 `./`，不使用这个flag则只进行计算和生成输入输出。
 								```bash
 								python run_pytorch.py --model gpt2  --batch_size 1  --length 1 --export_onnx ./
 								```
 								会在当前目录下生成输入输出文件`test_inputs.npy` 和 `test_results.npy`，目前只支持单一输入输出。
 								#### 2. 运行InfiniTensor分布式脚本
 								```bash
 								python cuda_launch.py --model "/XXX/XXX.onnx" --nproc_per_node 4
 								```
-												添加 MLU 平台分布式验收脚本 (#223)

* 添加 MLU 平台分布式验收脚本

* add fp16 test, fix cast

* fix

* add onnxsim for llama

* add matmul tf32 for mlu

* add submodule: onnxsim_large_model

* fix

* modified bang_launch.py, start_single

* add test for albert/opt

* change file path

---------

Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
											
										
										
											2024-04-28 11:24:09 +08:00
 								## 寒武纪平台运行方式
 								**将上述运行脚本 `run_pytorch.py` 以及 `cuda_launch.py` 针对寒武纪平台做了相应的适配，具体见 `run_pytorch_mlu.py` 以及 `bang_launch.py`。**
 								#### 1. 运行pytorch模型并生成输入和标准输出，可选择导出onnx
 								使用 `--export_onnx` 设置导出onnx的目录，默认为当前路径 `./`，不使用这个flag则只进行计算和生成输入输出。
 								```bash
 								python run_pytorch_mlu.py --model gpt2  --batch_size 1  --length 1 --export_onnx ./
 								```
 								会在当前目录下生成输入输出文件`test_inputs.npy` 和 `test_results.npy`，目前只支持单一输入输出。
 								#### 2. 运行InfiniTensor分布式脚本
 								```bash
 								python bang_launch.py --model "/XXX/XXX.onnx" --nproc_per_node 4
 								```