Compare commits

...

66 Commits

Author SHA1 Message Date
Hardy eafbff6cf9
Support kunlun new toolkit (#224)
Co-authored-by: wanghailu <wanghailu0717@163.com>
2024-04-03 09:56:52 +08:00
PanZezhong1725 7f6aec6c17
针对bert和gpt2模型分布式推理的优化 (#221)
* fix(dist): 改善分布式脚本,只打印绝对误差

* feat(dist): 增加可导出onnx的pytorch运行脚本

* feat(front): 增加对Y值为-inf的where算子的图优化

* feat(kernel): 对b为常数的pow和div算子进行特判优化

* fix(front): 消除前端对global output形状信息的依赖,分布式脚本删除不必要的shape infer

* feat(kernel): 针对matmul中bias为行向量时的expand操作的特化优化

* fix(kernel): 删除div pow const中不必要的同步

* Update expand.cu

* fix: fix comments

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Derui Yang <ydrml@hotmail.com>
2024-04-01 14:04:28 +08:00
xiaonans a98573990b
Accelerate llama (#219)
* [feature] add cudagraph support

* modify code to pass the cuda_all_reduce test

* modify rope op

* support rmsnorm

* add fp16 support to silu cuda op

* fix bugs in rmsnorm op

* uncomment simplify in onnx.py

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2024-04-01 08:46:05 +08:00
Chenjie Duan 54a35772fb
feature: add parameter to config matmul compute type (#218)
* feature: add parameter to config matmul compute type

* fix format
2024-03-26 09:00:45 +08:00
zhangyue 00e6cc2587
XCCL support (#171)
* add reduce_mean and gather

* fix format

* add kunlun allreduce and cmakefile

* add kunlun allreduce and cmakefile

* deltete cmake opt

* fix format

* fix makefile

* add DIST option in Makefile

* add xpu allgather

* delete xpu_wait()

* add xpu allgather

* delete specific compiler

* fix format

* fix gather

* add broadcast

* fix format

* fix

* fix xpu, add where operation, fix element-wise operation

* fix softmax

* fix softmax

* log internal input and output

* fix kunlun gather bugs

* update CMakeList.txt and Makefile

* fix some kunlun kernels

* fix Makefile

* fix Makefile

* set cmake version 3.12

* format

* fix where, gather and support gpt2

* "fix format"

* fix format

* copy onnx.py from master

* use KUNLUN_HOME instead of absolute path

* fix torchvision models

* support torchvison model-zoo

* fix format

* format fix, CMakeList fix

* fix review

* fix vecToString return value

* fix format

* delete  empty file

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2024-02-29 11:48:35 +08:00
baominghelly b51ccae3b2
fix broken link in docs (#216)
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2024-02-21 14:03:20 +08:00
xiaonans 1c08ba200c
[feature] add cudagraph support (#215)
* [feature] add cudagraph support

* modify code to pass the cuda_all_reduce test
2024-02-21 14:00:25 +08:00
xiaonans 900d8e58e3
Rope and silu (#214)
添加silu和rotary embedding算子的支持。
2024-02-04 11:05:27 +08:00
xiaonans b0876a13ce
Merge branch 'master' into rope_and_silu 2024-02-04 10:57:36 +08:00
xiaonans ae9f61de5a add comment for rope operator 2024-02-04 10:57:01 +08:00
xiaonans 9a3c0f11f6 add test for rotary embedding cuda kernel 2024-02-04 10:24:20 +08:00
zhangyunze 67b2bcb7d5
fix mlu some kernel registration & gather op (#210)
* fix: fix bang build/kernel registration | test_onnx

* delete assert float

* fix gather

* fix CMakeLists and Reshape

* fix cncl ops

* add hardsigmoid/hardswish

* fix

* add invalid datatype exception

* fix gather

* fix gather indices type

* fix gather/prelu/hardsigmoid on mlu

* fix format

* fix

---------

Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Zhang Bolun <Chamberlain0w0@gmail.com>
2024-02-01 15:02:02 +08:00
xiaonans 956ce37458 add unittest of silu kernel 2024-01-30 10:40:13 +08:00
zhangyunze 4813204a36
feat: add reshape/identity/squeeze/flatten/unsqueeze op cpu kernel (#213) 2024-01-30 10:29:59 +08:00
xiaonans 030e5ca9c1 Merge branch 'master' of github.com:InfiniTensor/InfiniTensor into rope_and_silu 2024-01-26 10:16:18 +08:00
xiaonans e8d111ef5d add rope and silu support 2024-01-26 10:01:27 +08:00
xiaonans d1a90ba3e2
[feature] support kvcache with static graph (#209)
* [feature] support kvcache with static graph

* use workspace to optimize kvcache attention

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2024-01-25 14:20:43 +08:00
xiaonans afed5d3c3d use workspace to optimize kvcache attention 2024-01-25 10:33:01 +08:00
Haojie Wang a5062f3f89
Update README.md 2024-01-24 22:16:48 +08:00
Hardy 09b2ecf98a
support more data type on mlu (#211)
* support more data type

* clang format

* fix little bug

* fix cncl datatype

* fix format

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Zhang Bolun <Chamberlain0w0@gmail.com>
2024-01-24 13:33:33 +08:00
xiaonans 6a1bfd6c45 [feature] support kvcache with static graph 2024-01-17 11:38:44 +08:00
Chenjie Duan 51086d2b8d
Modify kernel registration & support fp16 (#205)
* - Remove dataType from the kernel registration.

* - support fp16 for conv

* - cpu kernel: adapt the new registration mechanism

* modified all register kernel

* add where fp16

* add layernorm fp16

* add split_concat fp16

* - element_wise support fp16

* feat: support transpose fp16

* feat: support sliceOp fp16

* - unary support fp16

* - feat: support reduceOp fp16

* feat: support matmulOp/expandOp fp16

* feat: support powOp int8

* add cuda cast & support half-precision for gather

* style: fix style

* feat:support int8 for gather

* style:fix style

* modified test_cuda_conv_transposed

* fix: fix dist code to support fp16

* fix(graph.cc): fix topo_sort

* fix: fix recv and send kernel registration

* feat: add field tensors for stub

* refactor(frontend): 先排序后构图

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: 为中间结果提供tensor到node的mapping

* fix (slice): add guard for area out of range

* fix: fix matmul fp16

* fix: fix re-dataMalloc for weight tensor and use of naive allocator

* feat: add dataType filter for cuda kernel

* feat: bang kernel adapt the new registration mechanism

* fix: fix some error on mlu

* feat: intelcpu kernel adapt the new registration mechanism

* feat: modify kernel registration on kunlun

* fix intelcpu compiler bug

* feat: bang reshape support all dataType

* fix: fix bang reduce

* fix(all_reduce.cc): fix as reviewer suggessted

* fix: fix style and restore unary test codes

---------

Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com>
Co-authored-by: zhangyunze <z13785159769@163.com>
Co-authored-by: OdinaryWord <sx-hz@163.com>
Co-authored-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
2024-01-15 11:02:13 +08:00
zhangyunze 58993d4339
解除前端对onnx infershape功能的依赖 (#206)
* feat: SqueezeOp lift the dependency of onnx infershape.

* feat: UnsqueezeOp lift the dependency of onnx infershape.

* feat: lift the dependency of onnx infershape

* fix: fix Makefile off nccl
2024-01-12 14:54:27 +08:00
PanZezhong1725 46e61a5bd4
修正Slice内存越界问题 (#204)
fix (slice): add guard for area out of range

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2024-01-05 09:19:50 +08:00
zhangyunze b15c4979fa
fix Issue-189 question 1-15 (#195)
* fix: fix nativecpu elementwise only support 4d tensor

* fix format

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2024-01-05 08:40:18 +08:00
Hardy 42032356fb
Bang cncl (#163)
* MLU CNCL base

* add FindCNCL.cmake, not find -lcncl

* bangPrintFloat not find

* docker:make sucessful, test error

* delete net file and onnxtest.py

* init

* fix cncl

* format

* fix

* format

* fix cncl

* run dist gpt2 on mlu

* format

* fix import error on mlu docker

* run llama single card

* run distributed llama2

* add test for slice/reduce on mlu

* fix cncl related test

* fix format

* format

* delete comments

* change GPU to MLU

* MLU CNCL base

* add FindCNCL.cmake, not find -lcncl

* bangPrintFloat not find

* docker:make sucessful, test error

* delete net file and onnxtest.py

* init

* fix cncl

* format

* fix

* format

* fix cncl

* run dist gpt2 on mlu

* format

* fix import error on mlu docker

* run llama single card

* run distributed llama2

* add test for slice/reduce on mlu

* fix cncl related test

* fix format

* format

* delete comments

* change GPU to MLU

* modify launch script

* fix name

* fix format

* fix gather

* format python script

---------

Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
2024-01-03 13:28:03 +08:00
Chenjie Duan 83f1de93d0
add frontend resize kernel (#194)
* - add frontend resize kernel

* - fix resize test

* - fix bug
- add onnx test for resize

* fix: modify codes as reviewer suggested

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-12-29 13:32:56 +08:00
zhangyunze 3967b437c8
fix Issue 187 split infershape wrong (#197)
* fix: fix splitOp to support unequal portions

* fix: fix as review comment

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-12-28 21:39:24 +08:00
Chenjie Duan 6e7bd6ca0c
fix(perf.py): change NNmodel commit to fix perf.py (#203) 2023-12-28 21:31:39 +08:00
Hardy 5ac0ab442f
Fix bang (#198)
* fix bang batchnorm

* fix pooling test bang

* add test batchnorm

* HIGH PRECISION ACTIVATION

* fix pooling

* fix matmul

* fix test

* add layernorm

* fix softmax

* fix

* better code

* fix

* fix worlflow

* fix workflow

* fix

* fix

* fxi matmul

* add LRN

* fix lrn

* fix lrn

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Baoming Li <1508269885@qq.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-12-28 13:44:10 +08:00
Chenjie Duan 3f34372012
- modify error info when kernel not found (#191)
* - modify error info when kernel not found

* - modify code as reviewer suggested

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-12-27 09:43:57 +08:00
learner2468 9a9587556c
Add examples: inference of Paddle models (#192)
* Add paddle model and infer with InfiniTensor

* Remove unused import

---------

Co-authored-by: kilinchange <44265800+kilinchange@users.noreply.github.com>

【Hackathon No.106】Add paddle model and infer with InfiniTensor
2023-12-14 19:42:43 +08:00
xgqdut2016 a3929c25f8
Add send and recv operators based on NCCL (#182)
* baseline sendrecv, bug

* success sendrecv

* get rank from comm

* set output shape

* successful:set output shape equal to input shape

* shape as attribute

* success:shape as attribute

* success send recv, output 0

* add onnx test

* split send and recv

* success split send and recv

* test-onnx bug

* success test-onnx

* modified onnx.py

* solve review
2023-12-14 16:38:03 +08:00
Derui Yang c143eebdf7
不依赖 onnx models 的模型存储 (#196)
Signed-off-by: YdrMaster <ydrml@hotmail.com>
2023-12-11 10:44:06 +08:00
Hardy 67974aee8a
Fix https://github.com/InfiniTensor/InfiniTensor/pull/160 (#185)
Co-authored-by: wanghailu <wanghailu0717@163.com>
2023-11-27 14:18:12 +08:00
Hardy 3ead20a23a
Fix workspace & bang conv (#183)
* fix bang workspace

* fix convbpdata

* fix code

* add code

* fix

* fix

* fix conv

* fix test conv

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 15:16:25 +08:00
xgqdut2016 a7293c12ba
Add layer normalization (#181)
* - add layernorm kernel

* success:add layernorm kernel and test

* fix: remove unusalble comments

* fix: modify code as reviewer suggested

* debug,modified .cu and test

* optional bias support

* overloading function

* fix bug after merging; remove time constrain in conv test

---------

Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 15:15:14 +08:00
PanZezhong1725 6ece3f4a77
Add ReduceSum op and kernel (#160)
* Add reduceSum op and kernel

* fix merge and format

* Reduce: reuse cat macro, add doc string

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 09:29:58 +08:00
xgqdut2016 595a9906d2
add infer index function (#175)
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 09:24:25 +08:00
zhangyunze 331f7ab2b8
support Dynamic tensor infer shape and fix memory pool (#176)
* feat: support dynamic tensor part1

* feat: support dynamic-tensor part2

* feat: support dynamic tensor part 3

* fix: fix some ..

* - add kvcache example

* feat: support concat to identity kernel

* add a simple mempory pool for allocator

* fix: rebase to master

* fix bug after merging

* - remove outdated script

* fix: fix as review

---------

Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-23 13:11:50 +08:00
xiaonans 965df4e294
[feature] add fused attention_kvcache operator support (#179)
* [feature] add fused attention_kvcache operator support

* add test to attention_kvcache op

* Add space line at EOF

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-14 23:44:22 +08:00
Hardy f22fa2766e
add reduce_mean and gather on bang (#167)
* add code

* fix reduce_mean

* add softmax on BANG

* fix gather

* fix boradcast on ele kernel when dim size is zero

* add where kernel and fix softmax kernel

* fix convbpdata bug

* fix format

---------

Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 18:02:44 +08:00
Hardy 50862df765
[Kunlun & CUDA & BANG] add depth2space operator (#178)
* add depth2space operator

* fix format

* add depth2space on cambricon bang

* add depth2space on gpu

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 17:58:26 +08:00
Hardy 1ea450882b
add reduce_mean and gather on kunlun (#169)
* add reduce_mean and gather

* fix format

* fix gather

* fix

* fix xpu, add where operation, fix element-wise operation

* fix format

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 17:52:09 +08:00
xgqdut2016 d3e7543291
Cuda softmax (#129)
* "add softmax.cu,.cc,.h"

* Modify cuda softmax

* "modified the introduction of softmax.cu"

* "add format of cuda_softmax.h"

* "modified where.cc(.cu,.h) and softmax.cu"

* "modified format"

* Fix cpu softmax kernel

* "modified the // introduction of softmax.cu"

* "modified softmax.cu and use 1D block"

* "modified softmax.cu,format, and use 1D block"

* "introduce share mem to speed softmax"

* "reduce the input of function"

* modified the format

* remodify 2D block softmax

* remodify 1D block softmax

* modified the share memory

* add warp reduce

* conflict solve two

* remove extra space line

* solve comment

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
2023-11-06 08:56:23 +08:00
Derui Yang 1a6fccccbe
test: 支持编译 einnet 单元测试,但不是所有测试都能通过 (#174)
* test: 支持编译 einnet 单元测试,但不是所有测试都能通过

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* Fix: locating resource files and skip codegen

- Change the path parameters in `matchExprResult` and `checkExprLogSame` to paths relative to the project home
- Skip NNetMemboundOp tests as they require codegen

---------

Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2023-11-03 13:21:49 +08:00
xgqdut2016 ec3adf6fa7
support 8D tensor, add test example (#170)
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-31 10:47:36 +08:00
Bolun Zhang 23b825efc4
Xpu task4 support: add softmax (#172)
* add softmax on kunlun

* format

---------

Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-30 16:01:05 +08:00
constroy Li feccd4f318
fix tensor parallel for llama (#159)
* fix Slice

* change default rounds of timeit to 10 to reduce time

* fix slice with large ends

* Reshape support Int64

* support position_ids as input

* skip last MatMul in Llama

* skip infer_shapes to parse large model

* update launch.py

* fix split_concat_kernel

* print more message in launch.py

* Reshape supports both Int32 and Int64

* try infer_shapes and warn about failure

* fix format

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-30 15:04:16 +08:00
Haojie Wang 7f5188bedd
remove dimension limit of elementwise operators on xpu (#168) 2023-10-25 14:38:47 +08:00
baominghelly 07ef587c65
Change onnx-simplifier to onnxsim to resolve build issue on xpu (#164) 2023-10-21 02:58:32 +08:00
Derui Yang d0f9792613
Fix: add building option for NNet (#162)
Signed-off-by: YdrMaster <ydrml@hotmail.com>
2023-10-16 19:53:28 +08:00
Hardy 1184fa131f
Xpu (#82)
* support kunlun xpu and add an operator named Add

* add sub, mul, div, pow, maximum, minimum

* add code

* add xpu code

* add code

* add matmul

* add transpose

* add unary operator

* add unary operator

* add some operator

* add code

* support run resnet18 on xpu

* add code

* add max pool2d

* fix xpu code, let it can run.

* 添加XPU算子 (#120)

* add floordiv for xpu

* add batchnorm for xpu

* add more cast types for xpu

* add conv_trans for xpu

* add pad for xpu

* add logical ops for xpu

* fix format for xpu src and include

* fix format for xpu test

* fix format for xpu src

---------

Co-authored-by: Bolun <bolunz@u.nus.edu>

* Xpu abs (#121)

* add: unary kernel for xpu

* formatting

* format

* format

* format

* fix: pointer jump

* fix optype comments

* fix bug introduced while resolving conflict

* change cmake option for kunlunxin xpu from 'xpu' to 'kunlun'; fix bug after merging distributed infrastructure

* Add doc support for xpu (#141)

* fix

* fix

* fix pooling test

* format

* format

* fix

* fix

* set cmake version requirement

* fix cmakelists

* rename xpu to kunlun

* fix

* fix format

* fix format

* fix format

* fix change name to kunlun

* format

* fix format

* clang format

* fix format

---------

Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: zhangyue207 <138768300+zhangyue207@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: baominghelly <41820386+baominghelly@users.noreply.github.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
2023-10-16 10:57:08 +08:00
Haojie Wang 8e4d88fb9f
add transpose, concat and split for native cpu (#158) 2023-10-12 10:14:28 +08:00
PanZezhong1725 36ae7b7fb6
Add GatherElements op and cuda kernel (#149)
* Add GatherElements op and cuda kernel

* fix format

* remove print

* remove unused var

* fix spacing

* fix format

---------

Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-12 09:18:12 +08:00
PanZezhong1725 ed3034f878
Add HardSigmoid and HardSwish (#156)
* Add HardSigmoid and HardSwish

* fix format
2023-10-10 22:41:06 +08:00
kilinchange 1151101fb9
add naive allocator for debugging (#140)
* add naive allocator only for debugging

* merge redundant api

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-10 16:42:23 +08:00
Haojie Wang 90b9a80f72
add onnx simplify (#153)
* add onnx simplify

* fix test bug

* update ci policy

* fix onnx simpilfy bug

* update ci workflow
2023-10-10 15:45:27 +08:00
ChengXiang Qi 7f16fa353e
【Hackathon No.108】Add Gelu operator, ffi, kernel for cpu and gpu. (#148)
feat: Add Gelu kernel, operator, ffi.
2023-10-10 15:21:13 +08:00
PanZezhong1725 7600fe688c
Add Neg operator and kernel (#152)
* Add Neg operator and kernel

* handle neg in to_onnx

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-10 10:54:56 +08:00
Haojie Wang 7a9fcd93b2
Pooling ceil mode (#155)
* add ceil mode for pooling

* do not print debug info for allocator by default

* fix test bugs after introducing pooling ceil mode

* fix onnx import bug
2023-10-09 20:51:39 +08:00
PanZezhong1725 785853b0a3
Add erf kernel for cpu and gpu (#147)
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
2023-10-09 09:36:55 +08:00
Haojie Wang c0ff584e04
add constant op; fix concat bug (#151) 2023-10-08 21:42:41 +08:00
Haojie Wang f25bcca076
add python examples (#143)
* add python examples

* use copy*_numpy instead of copy*_float
2023-09-28 10:40:45 +08:00
kilinchange 877db21021
Fix support kvcache (#142)
* - fix onnx.py

* - fix shard_concat
2023-09-27 11:08:44 +08:00
PanZezhong1725 62be816f53
修复split concat当dim=0结果出错的问题 (#138)
Fix split_concat kernel not supporting dim=0

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-25 10:25:54 +08:00
368 changed files with 16879 additions and 3033 deletions

View File

@ -1,12 +1,11 @@
name: Build and test cpu name: Build and test cpu
on: on:
push: push:
branch: 'master'
paths-ignore: paths-ignore:
- '**.md' - '**.md'
- 'LICENSE' - 'LICENSE'
pull_request: pull_request:
paths-ignore: paths:
- '**.md' - '**.md'
- 'LICENSE' - 'LICENSE'
@ -15,10 +14,10 @@ env:
protobuf-version: "3.21.12" protobuf-version: "3.21.12"
python-version: "3.10" python-version: "3.10"
resnet-download: https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet18-v2-7.onnx resnet-download: https://github.com/InfiniTensor/InfiniTensor/releases/download/test-models/resnet18-v2-7.onnx
inception-download: https://media.githubusercontent.com/media/onnx/models/main/vision/classification/inception_and_googlenet/inception_v2/model/inception-v2-9.onnx inception-download: https://github.com/InfiniTensor/InfiniTensor/releases/download/test-models/inception-v2-9.onnx
densenet-download: https://github.com/onnx/models/raw/main/vision/classification/densenet-121/model/densenet-12.onnx densenet-download: https://github.com/InfiniTensor/InfiniTensor/releases/download/test-models/densenet-12.onnx
efficientnet-download: https://github.com/onnx/models/raw/main/vision/classification/efficientnet-lite4/model/efficientnet-lite4-11.onnx efficientnet-download: https://github.com/InfiniTensor/InfiniTensor/releases/download/test-models/efficientnet-lite4-11.onnx
jobs: jobs:
build: build:

View File

@ -1,12 +1,11 @@
name: clang-format Check name: clang-format Check
on: on:
push: push:
branch: 'master'
paths-ignore: paths-ignore:
- '**.md' - '**.md'
- 'LICENSE' - 'LICENSE'
pull_request: pull_request:
paths-ignore: paths:
- '**.md' - '**.md'
- 'LICENSE' - 'LICENSE'

View File

@ -1,19 +1,26 @@
cmake_minimum_required(VERSION 3.17) # FindCUDAToolkit
include(CMakeDependentOption)
project(InfiniTensor C CXX)
# Do not change these options in this file. Use cmake.config, cmake -DOPTION=VALUE, or ccmake to specify them. # Do not change these options in this file. Use cmake.config, cmake -DOPTION=VALUE, or ccmake to specify them.
option(USE_CUDA "Support CUDA GPU" OFF) option(USE_CUDA "Support CUDA GPU" OFF)
option(USE_BANG "Support BANG MLU" OFF) option(USE_BANG "Support BANG MLU" OFF)
option(USE_KUNLUN "Support KUNLUN XPU" OFF)
option(USE_INTELCPU "Support INTELCPU" OFF) option(USE_INTELCPU "Support INTELCPU" OFF)
option(USE_BACKTRACE "Print backtrace on exception and segmentation fault" ON) option(USE_BACKTRACE "Print backtrace on exception and segmentation fault" ON)
option(USE_PROTOBUF "Serialize and deserialize tensors" OFF) option(USE_PROTOBUF "Serialize and deserialize tensors" OFF)
option(BUILD_NNET "Build nnet" OFF)
option(BUILD_DIST "Build project for distributed running" OFF) option(BUILD_DIST "Build project for distributed running" OFF)
option(BUILD_TEST "Build tests" OFF) option(BUILD_TEST "Build tests" OFF)
if(USE_CUDA)
message("CMake 3.18 or higher is required for setting CUDAToolkit")
cmake_minimum_required(VERSION 3.18) # FindCUDAToolkit
else()
cmake_minimum_required(VERSION 3.17)
endif()
include(CMakeDependentOption)
project(InfiniTensor C CXX)
cmake_dependent_option(BUILD_TEST_CORE "Build tests for core components" ON BUILD_TEST OFF) cmake_dependent_option(BUILD_TEST_CORE "Build tests for core components" ON BUILD_TEST OFF)
cmake_dependent_option(BUILD_TEST_PET "Build tests for PET" OFF BUILD_TEST OFF) cmake_dependent_option(BUILD_TEST_PET "Build tests for PET" OFF BUILD_TEST OFF)
cmake_dependent_option(BUILD_TEST_EINNET "Build tests for EINNET" OFF BUILD_TEST OFF)
set(DEFAULT_BUILD_TYPE "RelWithDebInfo") set(DEFAULT_BUILD_TYPE "RelWithDebInfo")
# Build Type # Build Type
@ -23,14 +30,14 @@ if(CMAKE_BUILD_TYPE STREQUAL "Debug")
add_compile_definitions(DEBUG_MODE) add_compile_definitions(DEBUG_MODE)
elseif(CMAKE_BUILD_TYPE STREQUAL "Release") elseif(CMAKE_BUILD_TYPE STREQUAL "Release")
message("Configuring for Release build.") message("Configuring for Release build.")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
add_compile_definitions(NDEBUG) add_compile_definitions(NDEBUG)
elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo") elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo")
message("Configuring for RelWithDebInfo build.") message("Configuring for RelWithDebInfo build.")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -O2") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -O2")
else() else()
message("Build type not specified. Configuring for RelWithDebInfo build.") message("Build type not specified. Configuring for RelWithDebInfo build.")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -O2") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -O2")
endif() endif()
@ -46,11 +53,13 @@ endif()
set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_EXTENSIONS OFF) # -std=gnu++11 when on, -std=c++11 when off set(CMAKE_CXX_EXTENSIONS OFF) # -std=gnu++11 when on, -std=c++11 when off
add_compile_options(-Wno-error=unused-variable)
find_package( find_package(
Python Python
COMPONENTS Interpreter Development COMPONENTS Interpreter Development
REQUIRED) REQUIRED)
# OpenMP # OpenMP
find_package(OpenMP) find_package(OpenMP)
if(OpenMP_C_FOUND) if(OpenMP_C_FOUND)
@ -87,16 +96,17 @@ add_subdirectory(3rd-party/nlohmann_json_cmake_fetchcontent)
include_directories(3rd-party/nlohmann_json_cmake_fetchcontent/single_include) include_directories(3rd-party/nlohmann_json_cmake_fetchcontent/single_include)
# TVM backend # TVM backend
if(BUILD_TEST_EINNET) if(BUILD_NNET AND BUILD_TEST)
if (NOT TVM_INCLUDE_DIR OR NOT DMLC_INCLUDE_DIR OR NOT DLPACK_INCLUDE_DIR OR NOT DLPACK_INCLUDE_DIR)
message(FATAL_ERROR "TVM_INCLUDE_DIR, DMLC_INCLUDE_DIR, and DLPACK_INCLUDE_DIR must be set when BUILD_TEST_EINNET is ON")
endif()
# TVM and DMLC for invoking TVM packed functions # TVM and DMLC for invoking TVM packed functions
include_directories(${TVM_INCLUDE_DIR}) include_directories(${TVM_INCLUDE_DIR})
include_directories(${DMLC_INCLUDE_DIR}) include_directories(${DMLC_INCLUDE_DIR})
include_directories(${DLPACK_INCLUDE_DIR}) include_directories(${DLPACK_INCLUDE_DIR})
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DDMLC_USE_LOGGING_LIBRARY=\\\<${TVM_INCLUDE_DIR}/tvm/runtime/logging.h\\\> ") if (TVM_INCLUDE_DIR AND DMLC_INCLUDE_DIR AND DLPACK_INCLUDE_DIR AND DLPACK_INCLUDE_DIR)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DINFINI_USE_TVM=1") # Enable TVM codegen kernels set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DDMLC_USE_LOGGING_LIBRARY=\\\<${TVM_INCLUDE_DIR}/tvm/runtime/logging.h\\\> ")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DINFINI_USE_TVM=1") # Enable TVM codegen kernels
else()
# message(FATAL_ERROR "TVM_INCLUDE_DIR, DMLC_INCLUDE_DIR, and DLPACK_INCLUDE_DIR must be set when BUILD_NNET AND BUILD_TEST is ON")
endif()
endif() endif()
if(BUILD_TEST) if(BUILD_TEST)
@ -110,13 +120,21 @@ if(BUILD_TEST)
include_directories(3rd-party/googletest/googletest/include) include_directories(3rd-party/googletest/googletest/include)
endif() endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Werror -Wno-error=deprecated-declarations") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Werror -Wno-error=deprecated-declarations -Wno-error=pointer-arith")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -UNDEBUG") # Enable assertion set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -UNDEBUG") # Enable assertion
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -UNDEBUG") # Enable assertion set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -UNDEBUG") # Enable assertion
# Source files # Source files
file(GLOB_RECURSE SRC src/ffi/*.cc src/core/*.cc src/kernels/cpu/*.cc src/nnet/*.cc src/operators/*.cc src/utils/*.cc) file(GLOB_RECURSE SRC src/ffi/*.cc src/core/*.cc src/kernels/cpu/*.cc src/operators/*.cc src/utils/*.cc)
if(BUILD_NNET)
add_compile_definitions(BUILD_NNET=1)
file(GLOB_RECURSE SRC_NNET src/nnet/*.cc)
list (APPEND SRC ${SRC_NNET})
# For locating resource files
set_source_files_properties(src/nnet/test.cc PROPERTIES COMPILE_OPTIONS "-DINFINI_PROJECT_HOME=${CMAKE_CURRENT_SOURCE_DIR}")
endif()
if(USE_CUDA) if(USE_CUDA)
file(GLOB_RECURSE SRC_CUDA src/cuda/*.cc src/cuda/*.cu src/kernels/cuda/*.cc src/kernels/cuda/*.cu) file(GLOB_RECURSE SRC_CUDA src/cuda/*.cc src/cuda/*.cu src/kernels/cuda/*.cc src/kernels/cuda/*.cu)
@ -128,6 +146,11 @@ if(USE_BANG)
list (APPEND SRC ${SRC_BANG}) list (APPEND SRC ${SRC_BANG})
endif() endif()
if(USE_KUNLUN)
file(GLOB_RECURSE SRC_KUNLUN src/kunlun/*.cc src/kernels/kunlun/*.cc )
list (APPEND SRC ${SRC_KUNLUN})
endif()
if(USE_INTELCPU) if(USE_INTELCPU)
file(GLOB_RECURSE SRC_INTELCPU src/intelcpu/*.cc src/kernels/intelcpu/*.cc ) file(GLOB_RECURSE SRC_INTELCPU src/intelcpu/*.cc src/kernels/intelcpu/*.cc )
list (APPEND SRC ${SRC_INTELCPU}) list (APPEND SRC ${SRC_INTELCPU})
@ -142,7 +165,7 @@ endif()
target_link_libraries(InfiniTensor pybind11::embed) target_link_libraries(InfiniTensor pybind11::embed)
# TVM backend # TVM backend
if(BUILD_TEST_EINNET) if(BUILD_NNET AND BUILD_TEST AND TVM_LIB_DIR)
target_link_libraries(InfiniTensor ${TVM_LIB_DIR}/libtvm.so) target_link_libraries(InfiniTensor ${TVM_LIB_DIR}/libtvm.so)
endif() endif()
@ -240,7 +263,50 @@ if(USE_BANG)
# BangC Kernels # BangC Kernels
################################################################################ ################################################################################
target_link_libraries(InfiniTensor ${CAMBRICON_CNNL} ${CAMBRICON_CNRT} ${CAMBRICON_CNDRV} stdc++) if (BUILD_DIST)
find_library(CAMBRICON_CNCL libcncl.so "${NEUWARE_HOME}/lib64")
target_link_libraries(InfiniTensor ${CAMBRICON_CNCL} ${CAMBRICON_CNNL} ${CAMBRICON_CNRT} ${CAMBRICON_CNDRV} stdc++)
message(STATUS "Add BUILD_DIST, use CNCL with BANG")
add_compile_definitions(INFINI_USE_CNCL=1)
else()
target_link_libraries(InfiniTensor ${CAMBRICON_CNNL} ${CAMBRICON_CNRT} ${CAMBRICON_CNDRV} stdc++)
endif()
endif()
if(USE_KUNLUN)
add_compile_definitions(USE_KUNLUN=1)
if ((NOT DEFINED KUNLUN_HOME) AND (NOT DEFINED ENV{KUNLUN_HOME}))
message(FATAL_ERROR "KUNLUN_HOME is not defined from cmake or env")
elseif (DEFINED KUNLUN_HOME)
set(KUNLUN_HOME ${KUNLUN_HOME} CACHE STRING "KUNLUN_HOME directory for Kunlun development")
else()
set(KUNLUN_HOME $ENV{KUNLUN_HOME} CACHE STRING "KUNLUN_HOME directory for Kunlun development")
endif()
message(STATUS "KUNLUN_HOME: ${KUNLUN_HOME}")
include_directories("${KUNLUN_HOME}/include/")
find_library(KUNLUN_RT libxpurt.so "${KUNLUN_HOME}/lib64/")
find_library(KUNLUN_DNN libxpuapi.so "${KUNLUN_HOME}/lib64/")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lstdc++ -Wall -Werror")
if ((NOT DEFINED TARGET_CPU_ARCH) AND (NOT DEFINED ENV{TARGET_CPU_ARCH}))
execute_process(COMMAND uname -m OUTPUT_VARIABLE _uname_m OUTPUT_STRIP_TRAILING_WHITESPACE)
set(TARGET_CPU_ARCH "${_uname_m}" CACHE STRING "Target CPU ARCH")
elseif(DEFINED TARGET_CPU_ARCH)
set(TARGET_CPU_ARCH ${TARGET_CPU_ARCH} CACHE STRING "Target CPU ARCH")
else()
set(TARGET_CPU_ARCH $ENV{TARGET_CPU_ARCH} CACHE STRING "Target CPU ARCH")
endif()
message(STATUS "TARGET_CPU_ARCH: ${TARGET_CPU_ARCH}")
if (BUILD_DIST)
message(STATUS "Add BUILD_DIST, use XCCL with KUNLUN XPU")
list(APPEND CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/cmake)
find_package(XCCL REQUIRED)
add_compile_definitions(INFINI_USE_XCCL=1)
target_link_libraries(InfiniTensor ${XCCL_LIBRARIES})
endif()
target_link_libraries(InfiniTensor ${KUNLUN_RT} ${KUNLUN_DNN} stdc++)
endif() endif()
# # Python bindings # # Python bindings
@ -267,12 +333,18 @@ if(BUILD_TEST)
if(BUILD_TEST_CORE) if(BUILD_TEST_CORE)
build_test(test/core/*.cc) build_test(test/core/*.cc)
build_test(test/operators/*.cc) build_test(test/operators/*.cc)
build_test(test/kernels/nativecpu/*.cc)
if (USE_CUDA) if (USE_CUDA)
build_test(test/kernels/cuda/*.cc) build_test(test/kernels/cuda/*.cc)
build_test(test/cuda/*.cc) build_test(test/cuda/*.cc)
endif() endif()
if (USE_BANG) if (USE_BANG)
build_test(test/kernels/bang/*.cc) build_test(test/kernels/bang/*.cc)
build_test(test/bang/*.cc)
endif()
if (USE_KUNLUN)
build_test(test/kernels/kunlun/*.cc)
build_test(test/kunlun/*.cc)
endif() endif()
if (USE_INTELCPU) if (USE_INTELCPU)
build_test(test/kernels/intelcpu/*.cc) build_test(test/kernels/intelcpu/*.cc)
@ -281,7 +353,7 @@ if(BUILD_TEST)
if(BUILD_TEST_PET) if(BUILD_TEST_PET)
build_test(test/pet/*.cc) build_test(test/pet/*.cc)
endif() endif()
if(BUILD_TEST_EINNET) if(BUILD_NNET AND BUILD_TEST)
build_test(test/nnet/test_*.cc) build_test(test/nnet/test_*.cc)
# Build expression reader # Build expression reader

View File

@ -3,15 +3,19 @@
TYPE ?= Release TYPE ?= Release
CUDA ?= OFF CUDA ?= OFF
BANG ?= OFF BANG ?= OFF
KUNLUN ?= OFF
INTELCPU ?= off INTELCPU ?= off
BACKTRACE ?= ON BACKTRACE ?= ON
TEST ?= ON TEST ?= ON
DIST ?= OFF
NNET ?= OFF
DIST ?= OFF
FORMAT_ORIGIN ?= FORMAT_ORIGIN ?=
# Docker build options # Docker build options
DOCKER_NAME ?= infinitensor DOCKER_NAME ?= infinitensor
DOCKER_IMAGE_NAME ?= infinitensor DOCKER_IMAGE_NAME ?= infinitensor
DOCKER_FILE ?= infinitensor_ubuntu_22.04.dockerfile DOCKER_FILE ?= infinitensor_ubuntu_22.04.dockerfile
DOCKER_RUN_OPTION ?= DOCKER_RUN_OPTION ?=
# CUDA option. # CUDA option.
ifeq ($(CUDA), ON) ifeq ($(CUDA), ON)
@ -21,12 +25,14 @@ ifeq ($(CUDA), ON)
DOCKER_RUN_OPTION += --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v `pwd`:`pwd` -w `pwd` DOCKER_RUN_OPTION += --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v `pwd`:`pwd` -w `pwd`
endif endif
CMAKE_OPT = -DCMAKE_BUILD_TYPE=$(TYPE) CMAKE_OPT = -DCMAKE_BUILD_TYPE=$(TYPE)
CMAKE_OPT += -DUSE_CUDA=$(CUDA) CMAKE_OPT += -DUSE_CUDA=$(CUDA)
CMAKE_OPT += -DUSE_BANG=$(BANG) CMAKE_OPT += -DUSE_BANG=$(BANG)
CMAKE_OPT += -DUSE_KUNLUN=$(KUNLUN)
CMAKE_OPT += -DUSE_BACKTRACE=$(BACKTRACE) CMAKE_OPT += -DUSE_BACKTRACE=$(BACKTRACE)
CMAKE_OPT += -DBUILD_TEST=$(TEST) CMAKE_OPT += -DBUILD_TEST=$(TEST)
CMAKE_OPT += -DBUILD_DIST=$(DIST)
CMAKE_OPT += -DBUILD_NNET=$(NNET)
ifeq ($(INTELCPU), ON) ifeq ($(INTELCPU), ON)
CMAKE_OPT += -DUSE_INTELCPU=ON -DCMAKE_CXX_COMPILER=dpcpp CMAKE_OPT += -DUSE_INTELCPU=ON -DCMAKE_CXX_COMPILER=dpcpp
@ -58,7 +64,7 @@ test-api:
@echo @echo
python3 pyinfinitensor/tests/test_api.py python3 pyinfinitensor/tests/test_api.py
docker-build: docker-build:
docker build -f scripts/dockerfile/$(DOCKER_FILE) -t $(DOCKER_NAME) . docker build -f scripts/dockerfile/$(DOCKER_FILE) -t $(DOCKER_NAME) .
docker-run: docker-run:
@ -69,5 +75,3 @@ docker-start:
docker-exec: docker-exec:
docker exec -it $(DOCKER_IMAGE_NAME) bash docker exec -it $(DOCKER_IMAGE_NAME) bash

View File

@ -33,13 +33,14 @@ There are several configurable CMake options, see the [CMakeLists.txt](/CMakeLis
## Roadmap ## Roadmap
- [RefactorGraph](https://github.com/InfiniTensor/RefactorGraph) is a newly designed AI framework that is set to replace the current main branch.
- [EinNet](https://github.com/InfiniTensor/InfiniTensor/tree/NNET_e2e) is going to be merged into the main branch. - [EinNet](https://github.com/InfiniTensor/InfiniTensor/tree/NNET_e2e) is going to be merged into the main branch.
- Integration of [PET](https://github.com/thu-pacman/PET), a tensor program optimizer supporting partially equivalent transformations. - Integration of [PET](https://github.com/thu-pacman/PET), a tensor program optimizer supporting partially equivalent transformations.
- Supported hardware - Supported hardware
- ✔ NVIDIA GPU - ✔ NVIDIA GPU
- ✔ Cambricon MLU - ✔ Cambricon MLU
- ✔ Kunlunxin XPU
- ⬜ Ascend NPU - ⬜ Ascend NPU
- ⬜ Kunlunxin XPU
## Contributor Guide ## Contributor Guide

76
cmake/FindCNCL.cmake Normal file
View File

@ -0,0 +1,76 @@
SET(CNCL_LIB_SEARCH_PATHS $ENV{NEUWARE_HOME}/lib64)
SET(CNCL_INCLUDE_SEARCH_PATHS $ENV{NEUWARE_HOME}/include)
set(CNCL_INCLUDE_DIR $ENV{NEUWARE_HOME}/include)
set(CNCL_LIB_DIR $ENV{NEUWARE_HOME}/lib64)
set(CNCL_VERSION $ENV{CNCL_VERSION} CACHE STRING "Version of CNCL to build with")
if ($ENV{CNCL_ROOT_DIR})
message(WARNING "CNCL_ROOT_DIR is deprecated. Please set CNCL_ROOT instead.")
endif()
list(APPEND CNCL_ROOT $ENV{CNCL_ROOT_DIR} ${MLU_TOOLKIT_ROOT_DIR})
# Compatible layer for CMake <3.12. CNCL_ROOT will be accounted in for searching paths and libraries for CMake >=3.12.
list(APPEND CMAKE_PREFIX_PATH ${CNCL_ROOT})
find_path(CNCL_INCLUDE_DIRS
NAMES cncl.h
HINTS ${CNCL_INCLUDE_DIR})
if (USE_STATIC_CNCL)
MESSAGE(STATUS "USE_STATIC_CNCL is set. Linking with static CNCL library.")
SET(CNCL_LIBNAME "CNCL_static")
if (CNCL_VERSION) # Prefer the versioned library if a specific CNCL version is specified
set(CMAKE_FIND_LIBRARY_SUFFIXES ".a.${CNCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES})
endif()
else()
SET(CNCL_LIBNAME "cncl")
if (CNCL_VERSION) # Prefer the versioned library if a specific CNCL version is specified
set(CMAKE_FIND_LIBRARY_SUFFIXES ".so.${CNCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES})
endif()
endif()
find_library(CNCL_LIBRARIES
NAMES ${CNCL_LIBNAME}
HINTS ${CNCL_LIB_DIR})
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(CNCL DEFAULT_MSG CNCL_INCLUDE_DIRS CNCL_LIBRARIES)
if(CNCL_FOUND) # obtaining CNCL version and some sanity checks
set (CNCL_HEADER_FILE "${CNCL_INCLUDE_DIRS}/cncl.h")
message (STATUS "Determining CNCL version from ${CNCL_HEADER_FILE}...")
set (OLD_CMAKE_REQUIRED_INCLUDES ${CMAKE_REQUIRED_INCLUDES})
list (APPEND CMAKE_REQUIRED_INCLUDES ${CNCL_INCLUDE_DIRS})
include(CheckCXXSymbolExists)
check_cxx_symbol_exists(CNCL_VERSION_CODE CNCL.h CNCL_VERSION_DEFINED)
if (CNCL_VERSION_DEFINED)
set(file "${PROJECT_BINARY_DIR}/detect_cncl_version.cc")
file(WRITE ${file} "
#include <iostream>
#include <cncl.h>
int main()
{
std::cout << CNCL_MAJOR << '.' << CNCL_MINOR << '.' << CNCL_PATCH << std::endl;
int x;
CNCLGetVersion(&x);
return x == CNCL_VERSION_CODE;
}
")
try_run(CNCL_VERSION_MATCHED compile_result ${PROJECT_BINARY_DIR} ${file}
RUN_OUTPUT_VARIABLE CNCL_VERSION_FROM_HEADER
CMAKE_FLAGS "-DINCLUDE_DIRECTORIES=${CNCL_INCLUDE_DIRS}"
LINK_LIBRARIES ${CNCL_LIBRARIES})
if (NOT CNCL_VERSION_MATCHED)
message(FATAL_ERROR "Found CNCL header version and library version do not match! \
(include: ${CNCL_INCLUDE_DIRS}, library: ${CNCL_LIBRARIES}) Please set CNCL_INCLUDE_DIR and CNCL_LIB_DIR manually.")
endif()
message(STATUS "CNCL version: ${CNCL_VERSION_FROM_HEADER}")
else()
# message(STATUS "CNCL version < 2.3.5-5")
endif ()
set (CMAKE_REQUIRED_INCLUDES ${OLD_CMAKE_REQUIRED_INCLUDES})
message(STATUS "Found CNCL (include: ${CNCL_INCLUDE_DIRS}, library: ${CNCL_LIBRARIES})")
mark_as_advanced(CNCL_ROOT_DIR CNCL_INCLUDE_DIRS CNCL_LIBRARIES)
endif()

27
cmake/FindXCCL.cmake Normal file
View File

@ -0,0 +1,27 @@
# Find the xccl libraries
set(XCCL_INCLUDE_DIR $ENV{KUNLUN_HOME}/include CACHE PATH "Folder contains KUNLUN XCCL headers")
set(XCCL_LIB_DIR $ENV{KUNLUN_HOME} CACHE PATH "Folder contains KUNLUN XCCL libraries")
list(APPEND CMAKE_PREFIX_PATH $ENV{KUNLUN_HOME})
find_path(XCCL_INCLUDE_DIRS # ${XCCL_INCLUDE_DIR}
NAMES xpu/bkcl.h
HINTS XCCL_INCLUDE_DIR)
find_library(XCCL_LIBRARIES # ${XCCL_LIB_DIR}
NAMES lib64/libbkcl.so
HINTS XCCL_LIB_DIR)
message(STATUS "XCCL_INCLUDE_DIRS: ${XCCL_INCLUDE_DIRS}")
message(STATUS "XCCL_LIBRARIES: ${XCCL_LIBRARIES}")
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(XCCL DEFAULT_MSG XCCL_INCLUDE_DIRS XCCL_LIBRARIES)
if (XCCL_FOUND)
set (XCCL_HEADER_FILE "${XCCL_INCLUDE_DIRS}/xpu/bkcl.h")
message (STATUS "Determing XCCL version from ${XCCL_HEADER_FILE}...")
list (APPEND CMAKE_REQUIRED_INCLUDES ${XCCL_INCLUDE_DIRS})
message(STATUS "Found XCCL (include: ${XCCL_INCLUDE_DIRS}, library: ${XCCL_LIBRARIES})")
mark_as_advanced(XCCL_INCLUDE_DIRS XCCL_LIBRARIES)
endif()

View File

@ -133,6 +133,13 @@
make install-python BANG=ON make install-python BANG=ON
``` ```
编译 CPU 部分,同时编译昆仑 XPU 部分:
```bash
export KUNLUN_HOME=/path/to/your/kunlun_home
make install-python KUNLUN=ON
```
3. 使用方法 3. 使用方法
安装成功后,您就可以使用本项目的 Python 接口进行编码并运行。具体使用方式可以参考项目样例代码 example/Resnet/resnet.py 以及用户使用手册 安装成功后,您就可以使用本项目的 Python 接口进行编码并运行。具体使用方式可以参考项目样例代码 example/Resnet/resnet.py 以及用户使用手册

View File

@ -2,6 +2,7 @@
## 目录 ## 目录
- [环境支持](#环境支持) - [环境支持](#环境支持)
- [神经网络支持](#神经网络支持) - [神经网络支持](#神经网络支持)
- [技术支持](#技术支持) - [技术支持](#技术支持)
@ -19,10 +20,10 @@
目前已经验证过的神经网络模型有 目前已经验证过的神经网络模型有
- [x] [ResNet18-v2](https://github.com/onnx/models/blob/main/vision/classification/resnet/model/resnet18-v2-7.onnx) - [x] [ResNet18-v2](https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet18-v2-7.onnx)
- [x] [DenseNet-121-12](https://github.com/onnx/models/blob/main/vision/classification/densenet-121/model/densenet-12.onnx) - [x] [DenseNet-121-12](https://github.com/onnx/models/blob/main/validated/vision/classification/densenet-121/model/densenet-12.onnx)
- [x] [Inception-2](https://github.com/onnx/models/blob/main/vision/classification/inception_and_googlenet/inception_v2/model/inception-v2-9.onnx) - [x] [Inception-2](https://github.com/onnx/models/blob/main/validated/vision/classification/inception_and_googlenet/inception_v2/model/inception-v2-9.onnx)
- [x] [EfficientNet-Lite4](https://github.com/onnx/models/blob/main/vision/classification/efficientnet-lite4/model/efficientnet-lite4-11.onnx) - [x] [EfficientNet-Lite4](https://github.com/onnx/models/blob/main/validated/vision/classification/efficientnet-lite4/model/efficientnet-lite4-11.onnx)
## 技术支持 ## 技术支持

View File

@ -3,9 +3,10 @@
## 目录 ## 目录
- [使用方法](#使用方法) - [使用方法](#使用方法)
- [python-前端应用指南](#python-前端应用指南) - [python 前端应用指南](#python-前端应用指南)
- [导入-onnx-模型](#导入-onnx-模型) - [导入 onnx 模型](#导入-onnx-模型)
- [导出-onnx-模型](#导出-onnx-模型) - [优化](#优化)
- [导出 onnx 模型](#导出-onnx-模型)
- [执行推理](#执行推理) - [执行推理](#执行推理)
- [样例代码](#样例代码) - [样例代码](#样例代码)
- [技术支持](#技术支持) - [技术支持](#技术支持)
@ -13,7 +14,7 @@
## 使用方法 ## 使用方法
项目管理功能已写到 [Makefile](Makefile),支持下列功能: 项目管理功能已写到 [Makefile](../Makefile),支持下列功能:
- 编译项目:`make`/`make build` - 编译项目:`make`/`make build`
- 清理生成文件:`make clean` - 清理生成文件:`make clean`
@ -26,6 +27,7 @@
- `TYPE`:编译模式(`debug`/`release`),默认值为 `release` - `TYPE`:编译模式(`debug`/`release`),默认值为 `release`
- `CUDA`:是否编译 CUDA 后端,默认为 `OFF``ON` 打开 - `CUDA`:是否编译 CUDA 后端,默认为 `OFF``ON` 打开
- `BANG`:是否编译寒武纪后端,默认为 `OFF``ON` 打开 - `BANG`:是否编译寒武纪后端,默认为 `OFF``ON` 打开
- `KUNLUN`:是否编译昆仑后端,默认为 `OFF``ON` 打开
- `BACKTRACE`:是否启用栈回溯,默认为 `ON``OFF` 关闭,建议调试时打开 - `BACKTRACE`:是否启用栈回溯,默认为 `ON``OFF` 关闭,建议调试时打开
- `TEST`:是否编译 `googletest`,默认为 `ON``OFF` 关闭,只有 `test-cpp` 时必要 - `TEST`:是否编译 `googletest`,默认为 `ON``OFF` 关闭,只有 `test-cpp` 时必要
@ -37,10 +39,10 @@
支持的模型: 支持的模型:
- [x] [ResNet18-v2](https://github.com/onnx/models/blob/main/vision/classification/resnet/model/resnet18-v2-7.onnx) - [x] [ResNet18-v2](https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet18-v2-7.onnx)
- [x] [DenseNet-121-12](https://github.com/onnx/models/blob/main/vision/classification/densenet-121/model/densenet-12.onnx) - [x] [DenseNet-121-12](https://github.com/onnx/models/blob/main/validated/vision/classification/densenet-121/model/densenet-12.onnx)
- [x] [Inception-2](https://github.com/onnx/models/blob/main/vision/classification/inception_and_googlenet/inception_v2/model/inception-v2-9.onnx) - [x] [Inception-2](https://github.com/onnx/models/blob/main/validated/vision/classification/inception_and_googlenet/inception_v2/model/inception-v2-9.onnx)
- [x] [EfficientNet-Lite4](https://github.com/onnx/models/blob/main/vision/classification/efficientnet-lite4/model/efficientnet-lite4-11.onnx) - [x] [EfficientNet-Lite4](https://github.com/onnx/models/blob/main/validated/vision/classification/efficientnet-lite4/model/efficientnet-lite4-11.onnx)
```python ```python
import onnx import onnx
@ -95,7 +97,7 @@ for name, tensor in stub.inputs.items():
print(name, tensor.shape(), tensor) print(name, tensor.shape(), tensor)
``` ```
对于 [resnet18-v2-7.onnx](https://github.com/onnx/models/blob/main/vision/classification/resnet/model/resnet18-v2-7.onnx),会打印出: 对于 [resnet18-v2-7.onnx](https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet18-v2-7.onnx),会打印出:
```plaintext ```plaintext
data [1, 3, 224, 224] <backend.Tensor object at 0x7efeb828e3b0> data [1, 3, 224, 224] <backend.Tensor object at 0x7efeb828e3b0>
@ -136,7 +138,7 @@ for name, tensor in stub.outputs.items():
### 样例代码 ### 样例代码
您可以参照[./example/Resnet/resnet.py](./example/ResNet/resnet.py)的样例代码进行了解,并尝试运行。在这个文件中,我们使用了 Pytorch 构建了 resnet 网络。您可以查阅该脚本使用方式: 您可以参照[resnet.py](https://github.com/wanghailu0717/NNmodel/blob/main/ResNet/resnet.py)的样例代码进行了解,并尝试运行。在这个文件中,我们使用了 Pytorch 构建了 resnet 网络。您可以查阅该脚本使用方式:
```python ```python
python resnet.py -h python resnet.py -h

2
env.sh
View File

@ -35,4 +35,4 @@ export LD_LIBRARY_PATH="${NEUWARE_HOME}/lib64:${LD_LIBRARY_PATH}"
# ├── tools # ├── tools
# ├── version # ├── version
# └── XTDK # └── XTDK
export XPU_HOME=/usr/local/xpu export KUNLUN_HOME=/usr/local/xpu

View File

@ -0,0 +1,17 @@
# 分布式脚本
#### 1. 运行pytorch模型并生成输入和标准输出可选择导出onnx
使用 `--export_onnx` 设置导出onnx的目录默认为当前路径 `./`不使用这个flag则只进行计算和生成输入输出。
```bash
python run_pytorch.py --model gpt2 --batch_size 1 --length 1 --export_onnx ./
```
会在当前目录下生成输入输出文件`test_inputs.npy` 和 `test_results.npy`,目前只支持单一输入输出。
#### 2. 运行InfiniTensor分布式脚本
```bash
python cuda_launch.py --model "/XXX/XXX.onnx" --nproc_per_node 4
```

View File

@ -0,0 +1,196 @@
import argparse
import os
import time
import multiprocessing as mp
from pyinfinitensor.onnx import OnnxStub, backend
import onnx
from onnx.shape_inference import infer_shapes_path
import numpy as np
from parallel_opt import parallel_model
def parse_args():
parser = argparse.ArgumentParser(description="launch distributed infinitensor")
parser.add_argument("--num_nodes", type=int, default=1, help="number of nodes")
parser.add_argument(
"--nproc_per_node", type=int, default=2, help="number of processes per node"
)
parser.add_argument(
"--name", type=str, default="test", help="name of this instance."
)
parser.add_argument(
"--model", type=str, default="/data/onnx_models/llama2/llama_bs1_seq1024.onnx",
help="path to the ONNX model file."
)
parser.add_argument("--batch_size", type=int, default=1, help="batch size.")
parser.add_argument("--length", type=int, default=1, help="sequence length.")
parser.add_argument(
"--gen_std",
default=False,
action="store_true",
help="whether to generate the standard results.",
)
args = parser.parse_args()
print("arg setting: ", args)
return (
args.num_nodes,
args.nproc_per_node,
args.name,
args.model,
args.batch_size,
args.length,
args.gen_std,
)
def run_model(model, runtime, world_size=1, rank=0, n=10):
stub = OnnxStub(model, runtime)
load_inputs(stub, world_size, rank)
# stub.tune()
stub.run()
# get outputs
time.sleep(0.01)
outputs = next(stub.outputs.values().__iter__()).copyout_numpy()
# bench
begin = time.time()
for _ in range(n):
stub.run()
end = time.time()
avg_time = (end - begin) / n
print(f"average time: {avg_time}")
return outputs
def run_and_compare(name, model, runtime, world_size=1, rank = 0):
results = np.load(f"./data/output.npy")
outputs = run_model(model, runtime, world_size, rank)
print("answer argmax:", np.argmax(results))
print("output argmax:", np.argmax(outputs))
#np.testing.assert_allclose(outputs, results, rtol=1e-3, atol=1e-3)
getDiff(results, outputs)
def start_worker(
name: str, world_size: int, rank: int, local_rank: int, model: onnx.ModelProto
):
dist_name = name + "_dist"
model = parallel_model(model, world_size, rank)
extern_path = f"./{dist_name}_rank{rank}.pb"
if os.path.exists(extern_path):
os.remove(extern_path)
onnx.save_model(
model,
f"./{dist_name}_rank{rank}.onnx",
save_as_external_data=True,
location=extern_path,
)
infer_shapes_path(f"./{dist_name}_rank{rank}.onnx")
runtime = backend.BangRuntime(local_rank)
# print("init comm")
runtime.init_comm(
dist_name,
world_size,
rank,
)
run_and_compare(name, model, runtime, world_size, rank)
def start_single(name, model):
runtime = backend.BangRuntime(0)
run_and_compare(name, model, runtime)
def generate_input_output(model):
os.makedirs(os.path.dirname("./data/"), exist_ok=True)
runtime = backend.BangRuntime(0)
stub = OnnxStub(model, runtime)
position_id = 0
for i, (name, tensor) in enumerate(stub.inputs.items()):
input = tensor.copyout_numpy()
if np.issubdtype(input.dtype, np.integer):
if input.size == 1:
# input = np.array([position_id])
input = np.random.randint(0,2,size=input.shape, dtype=input.dtype)
else:
input = np.random.randint(0,2,size=input.shape, dtype=input.dtype)
elif input.dtype == np.bool_:
input = np.random.randint(0,2,size=input.shape) > 0
else:
if i == 0:
input = np.ones(input.shape).astype(input.dtype)
position_id = input.shape[-1] - 1
else:
input = np.random.rand(*input.shape).astype(input.dtype)
tensor.copyin_numpy(input)
np.save(f"./data/input_{i}", input)
stub.run()
time.sleep(0.01)
output = next(stub.outputs.values().__iter__()).copyout_numpy()
if np.isnan(output).any():
print("Nan in output")
np.save(f"./data/output", output)
def load_inputs(stub, world_size=1, rank=0):
for i, (name, tensor) in enumerate(stub.inputs.items()):
input = np.load(f"./data/input_{i}.npy")
if all(x == y for x,y in zip(input.shape,tensor.shape())):
tensor.copyin_numpy(input)
else:
tensor.copyin_numpy(np.hsplit(input, world_size)[rank])
def getDiff(base, test):
absolute_diff = np.abs(np.subtract(base, test))
max_absolute_diff = np.max(absolute_diff)
baseCopy = base.astype(np.float64).ravel()
testCopy = test.astype(np.float64).ravel()
upValue = np.sum(np.abs(baseCopy - testCopy))
downValue = np.sum(np.abs(baseCopy)) + np.float64(1e-9)
max_relative_diff = upValue / downValue
print(f"Max absolute difference: {max_absolute_diff}\n"
f"Max relative difference: {max_relative_diff}")
return max_absolute_diff, max_relative_diff
def main():
nnodes, nproc_per_node, name, model_path, bs, length, gen_std = parse_args()
model = onnx.load(model_path)
# generate standart output
if gen_std:
print("Generate inputs and outputs.")
p = mp.Process(target=generate_input_output, args=[model])
p.start()
p.join()
return
# run single process.
# use standalone process to isolate cuda.
print("run model by single MLU.")
p = mp.Process(target=start_single, args=(name, model))
p.start()
p.join()
# run distributed parallel.
world_size = nnodes * nproc_per_node
print(f"run model by {world_size} MLUs in parallel.")
workers = [
mp.Process(
target=start_worker,
args=(name, world_size, rank, rank % nproc_per_node, model),
)
for rank in range(world_size)
]
for w in workers:
w.start()
for w in workers:
w.join()
if __name__ == "__main__":
main()

View File

@ -5,6 +5,7 @@ import multiprocessing as mp
from pyinfinitensor.onnx import OnnxStub, backend from pyinfinitensor.onnx import OnnxStub, backend
import onnx import onnx
from onnx.external_data_helper import convert_model_to_external_data from onnx.external_data_helper import convert_model_to_external_data
from onnx.shape_inference import infer_shapes_path
import numpy as np import numpy as np
from parallel_opt import parallel_model from parallel_opt import parallel_model
@ -44,16 +45,18 @@ def parse_args():
) )
def run_model(model, runtime, inputs: np.array, n=20): def run_model(model, runtime, inputs, n=10):
stub = OnnxStub(model, runtime) stub = OnnxStub(model, runtime)
next(stub.inputs.items().__iter__())[1].copyin_numpy(inputs) for tensor, input in zip(stub.inputs.values(), inputs, strict=False):
stub.tune() tensor.copyin_numpy(input)
# stub.tune()
stub.run() stub.run()
# get outputs # get outputs
outputs = np.array(next(stub.outputs.items().__iter__())[1].copyout_float()) outputs = next(stub.outputs.values().__iter__()).copyout_numpy()
# bench # bench
next(stub.inputs.items().__iter__())[1].copyin_numpy(inputs) for tensor, input in zip(stub.inputs.values(), inputs, strict=False):
tensor.copyin_numpy(input)
begin = time.time() begin = time.time()
for _ in range(n): for _ in range(n):
stub.run() stub.run()
@ -64,13 +67,12 @@ def run_model(model, runtime, inputs: np.array, n=20):
def run_and_compare(name, model, runtime): def run_and_compare(name, model, runtime):
data = np.load(f"{name}_inputs.npy") input_ids = np.load(f"{name}_inputs.npy")
position_ids = np.arange(input_ids.shape[-1])
results = np.load(f"{name}_results.npy") results = np.load(f"{name}_results.npy")
outputs = run_model(model, runtime, data) outputs = run_model(model, runtime, (input_ids, position_ids))
print("outputs sum:", outputs.sum()) print("outputs abs mean:", abs(outputs).mean())
print("max abs diff:", abs(outputs - results).max()) print("max abs diff:", abs(outputs - results).max())
print("max rel diff:", abs((outputs - results) / results).max())
# assert np.allclose(outputs, results, rtol=1e-3, atol=1e-6)
def start_worker( def start_worker(
@ -81,14 +83,13 @@ def start_worker(
extern_path = f"./{dist_name}_rank{rank}.pb" extern_path = f"./{dist_name}_rank{rank}.pb"
if os.path.exists(extern_path): if os.path.exists(extern_path):
os.remove(extern_path) os.remove(extern_path)
convert_model_to_external_data( onnx.save_model(
model, model,
all_tensors_to_one_file=True, f"./{dist_name}_rank{rank}.onnx",
save_as_external_data=True,
location=extern_path, location=extern_path,
size_threshold=1024,
convert_attribute=False,
) )
onnx.save(model, f"./{dist_name}_rank{rank}.onnx") #infer_shapes_path(f"./{dist_name}_rank{rank}.onnx")
runtime = backend.CudaRuntime(local_rank) runtime = backend.CudaRuntime(local_rank)
# print("init comm") # print("init comm")
runtime.init_comm( runtime.init_comm(
@ -106,10 +107,12 @@ def start_single(name, model):
def gen_standard(name, model, voc_size, bs, len): def gen_standard(name, model, voc_size, bs, len):
# generate standard results # generate standard results
data = np.random.randint(0, voc_size, (bs, len), dtype=np.int32) input_ids = np.random.randint(0, voc_size, (bs, len))
np.save(f"{name}_inputs", data) position_ids = np.arange(len)
np.save(f"{name}_inputs", input_ids)
runtime = backend.CudaRuntime(0) runtime = backend.CudaRuntime(0)
outputs = run_model(model, runtime, data, 1) outputs = run_model(model, runtime, (input_ids, position_ids), 1)
print("outputs abs mean:", abs(outputs).mean())
np.save(f"{name}_results", outputs) np.save(f"{name}_results", outputs)
@ -128,12 +131,14 @@ def main():
# run single process. # run single process.
# use standalone process to isolate cuda. # use standalone process to isolate cuda.
print("run model by single GPU.")
p = mp.Process(target=start_single, args=(name, model)) p = mp.Process(target=start_single, args=(name, model))
p.start() p.start()
p.join() p.join()
# run distributed parallel. # run distributed parallel.
world_size = nnodes * nproc_per_node world_size = nnodes * nproc_per_node
print(f"run model by {world_size} GPU in parallel.")
workers = [ workers = [
mp.Process( mp.Process(
target=start_worker, target=start_worker,

View File

@ -0,0 +1,213 @@
import argparse
import os
import time
import multiprocessing as mp
from pyinfinitensor.onnx import OnnxStub, backend
import onnx
from onnx.external_data_helper import convert_model_to_external_data
from onnx.shape_inference import infer_shapes_path
import numpy as np
from parallel_opt import parallel_model
st_input_dir = "standard/inputs/"
st_output_dir = "standard/outputs/"
def parse_args():
parser = argparse.ArgumentParser(description="launch distributed infinitensor")
parser.add_argument("--num_nodes", type=int, default=1, help="number of nodes")
parser.add_argument(
"--nproc_per_node", type=int, default=2, help="number of processes per node"
)
parser.add_argument(
"--name", type=str, default="test", help="name of this instance."
)
parser.add_argument(
"--model", type=str, default="/data1/shared/panzezhong/llama/fp32/my_llama_fp32.sim.onnx", help="path to the ONNX model file."
)
parser.add_argument("--batch_size", type=int, default=1, help="batch size.")
parser.add_argument("--length", type=int, default=1, help="sequence length.")
parser.add_argument(
"--gen_std",
default=False,
action="store_true",
help="whether to generate the standard results.",
)
parser.add_argument(
"--run_single",
default=False,
action="store_true",
help="whether run model with single process with standard inputs"
)
args = parser.parse_args()
print("arg setting: ", args)
return (
args.num_nodes,
args.nproc_per_node,
args.name,
args.model,
args.batch_size,
args.length,
args.gen_std,
args.run_single
)
def run_model(model, runtime, world_size=1, rank=0, n=10):
stub = OnnxStub(model, runtime)
load_inputs(stub, world_size, rank)
# stub.tune()
stub.run()
# get outputs
time.sleep(0.01)
outputs = next(stub.outputs.values().__iter__()).copyout_numpy()
# bench
begin = time.time()
for _ in range(n):
stub.run()
end = time.time()
avg_time = (end - begin) / n
print(f"average time: {avg_time}")
return outputs
def run_and_compare(name, model, runtime, world_size=1, rank = 0):
results = np.load(os.path.join(st_output_dir,f"output.npy"))
outputs = run_model(model, runtime, world_size, rank)
print(outputs[:100])
if np.isnan(outputs).any():
print("Nan in output")
print("answer argmax:", np.argmax(results))
print("output argmax:", np.argmax(outputs))
#np.testing.assert_allclose(outputs, results, rtol=1e-3, atol=1e-3)
getDiff(results, outputs)
def start_worker(
name: str, world_size: int, rank: int, local_rank: int, model: onnx.ModelProto
):
dist_name = name + "_dist"
model = parallel_model(model, world_size, rank)
extern_path = f"./{dist_name}_rank{rank}.pb"
if os.path.exists(extern_path):
os.remove(extern_path)
onnx.save_model(
model,
f"./{dist_name}_rank{rank}.onnx",
save_as_external_data=True,
location=extern_path,
)
infer_shapes_path(f"./{dist_name}_rank{rank}.onnx")
runtime = backend.KUNLUNRuntime(local_rank)
# print("init comm")
runtime.init_comm(
dist_name,
world_size,
rank,
)
run_and_compare(name, model, runtime, world_size, rank)
def start_single(name, model):
runtime = backend.KUNLUNRuntime(0)
run_and_compare(name, model, runtime)
def generate_input_output(model):
runtime = backend.KUNLUNRuntime(0)
stub = OnnxStub(model, runtime)
position_id = 0
for i, (name, tensor) in enumerate(stub.inputs.items()):
input = tensor.copyout_numpy()
if np.issubdtype(input.dtype, np.integer):
if input.size == 1:
# input = np.array([position_id])
input = np.random.randint(0,2,size=input.shape, dtype=input.dtype)
else:
input = np.random.randint(0,2,size=input.shape, dtype=input.dtype)
elif input.dtype == np.bool_:
input = np.random.randint(0,2,size=input.shape) > 0
else:
if i == 0:
input = np.ones(input.shape).astype(input.dtype)
position_id = input.shape[-1] - 1
else:
input = np.random.rand(*input.shape).astype(input.dtype)
tensor.copyin_numpy(input)
np.save(os.path.join(st_input_dir, f"input_{i}"), input)
stub.run()
# print(stub.outputs)
time.sleep(0.01)
output = next(stub.outputs.values().__iter__()).copyout_numpy()
print(output[:100])
if np.isnan(output).any():
print("Nan in output")
np.save(os.path.join(st_output_dir, f"output"), output)
def load_inputs(stub, world_size=1, rank=0):
for i, (name, tensor) in enumerate(stub.inputs.items()):
input = np.load(os.path.join(st_input_dir, f"input_{i}.npy"))
if all(x == y for x,y in zip(input.shape,tensor.shape())):
tensor.copyin_numpy(input)
else:
tensor.copyin_numpy(np.hsplit(input, world_size)[rank])
def getDiff(base, test):
absolute_diff = np.abs(np.subtract(base, test))
max_absolute_diff = np.max(absolute_diff)
baseCopy = base.astype(np.float64).ravel()
testCopy = test.astype(np.float64).ravel()
upValue = np.sum(np.abs(baseCopy - testCopy))
downValue = np.sum(np.abs(baseCopy)) + np.float64(1e-9)
max_relative_diff = upValue / downValue
print(f"Max absolute difference: {max_absolute_diff}\nMax relative difference: {max_relative_diff}")
return max_absolute_diff, max_relative_diff
def main():
nnodes, nproc_per_node, name, model_path, bs, length, gen_std, run_single = parse_args()
model = onnx.load(model_path)
# generate standart output
if gen_std:
print("Generate inputs and outputs.")
p = mp.Process(target=generate_input_output, args=[model])
p.start()
p.join()
return
# # run single process.
# # use standalone process to isolate cuda.
if run_single:
print("run model by single GPU.")
p = mp.Process(target=start_single, args=(name, model))
p.start()
p.join()
return
# run distributed parallel.
world_size = nnodes * nproc_per_node
print(f"run model by {world_size} GPU in parallel.")
workers = [
mp.Process(
target=start_worker,
args=(name, world_size, rank, rank % nproc_per_node, model),
)
for rank in range(world_size)
]
for w in workers:
w.start()
for w in workers:
w.join()
if __name__ == "__main__":
main()

View File

@ -11,6 +11,7 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
vinfo = {info.name: info for info in model.graph.value_info} vinfo = {info.name: info for info in model.graph.value_info}
vinfo.update({info.name: info for info in model.graph.input}) vinfo.update({info.name: info for info in model.graph.input})
vinfo.update({info.name: info for info in model.graph.output}) vinfo.update({info.name: info for info in model.graph.output})
output = {info.name: info for info in model.graph.output}
place: Dict[str, Placement] = {} place: Dict[str, Placement] = {}
nodes: List[NodeProto] = [] nodes: List[NodeProto] = []
@ -56,11 +57,11 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
ndim = len(vinfo[output].type.tensor_type.shape.dim) ndim = len(vinfo[output].type.tensor_type.shape.dim)
out_plc = Shard(ndim - 1) if in_plc.is_replicate() else _Partial() out_plc = Shard(ndim - 1) if in_plc.is_replicate() else _Partial()
place[node.output[0]] = out_plc place[node.output[0]] = out_plc
def shard_concat(node: NodeProto): def shard_concat(node: NodeProto):
# hack for kvcache # hack for kvcache
in_plc = place[node.input[1]] in_plc = place[node.input[1]]
if in_plc.is_sharded(): if in_plc.is_shard():
seq_len_dim = vinfo[node.input[0]].type.tensor_type.shape.dim.pop(1) seq_len_dim = vinfo[node.input[0]].type.tensor_type.shape.dim.pop(1)
seq_len_dim.dim_value //= tp_world_size seq_len_dim.dim_value //= tp_world_size
vinfo[node.input[0]].type.tensor_type.shape.dim.insert(1, seq_len_dim) vinfo[node.input[0]].type.tensor_type.shape.dim.insert(1, seq_len_dim)
@ -114,7 +115,7 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
assert out_dims[s_dim] % tp_world_size == 0, out_dims assert out_dims[s_dim] % tp_world_size == 0, out_dims
out_dims[s_dim] //= tp_world_size out_dims[s_dim] //= tp_world_size
# if ONNX uses the same tensor for multiple Reshape Nodes, then rename it to distingush from others. # if ONNX uses the same tensor for multiple Reshape Nodes, then rename it to distingush from others.
# node.input[1] = node.output[0] + "_shape" node.input[1] = node.output[0] + "_shape"
data[node.input[1]] = numpy_helper.from_array(out_dims, name=node.input[1]) data[node.input[1]] = numpy_helper.from_array(out_dims, name=node.input[1])
place[node.output[0]] = Shard(s_dim) place[node.output[0]] = Shard(s_dim)
@ -136,7 +137,7 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
place[node.output[0]] = Shard(list(perm).index(plc.dim)) place[node.output[0]] = Shard(list(perm).index(plc.dim))
def shard_node(node: NodeProto): def shard_node(node: NodeProto):
if node.op_type in ["Relu", "Tanh", "Softmax"]: if node.op_type in ["Relu", "Tanh", "Softmax", "Cast"]:
place[node.output[0]] = place[node.input[0]] place[node.output[0]] = place[node.input[0]]
elif node.op_type in ["Where"]: elif node.op_type in ["Where"]:
place[node.output[0]] = place[node.input[1]] place[node.output[0]] = place[node.input[1]]
@ -154,7 +155,7 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
), f"{place[node.input[0]]} != {place[node.input[1]]}" ), f"{place[node.input[0]]} != {place[node.input[1]]}"
place[node.output[0]] = place[node.input[0]] place[node.output[0]] = place[node.input[0]]
elif node.op_type == "Concat": elif node.op_type == "Concat":
shard_concat(node) shard_concat(node)
def find_successor(op_type: str, idx: int, search_limit: int = 1): def find_successor(op_type: str, idx: int, search_limit: int = 1):
for node in model.graph.node[idx + 1 : idx + 1 + search_limit]: for node in model.graph.node[idx + 1 : idx + 1 + search_limit]:
@ -175,6 +176,16 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
if (node.op_type == "MatMul" or node.op_type == "Gemm") and any( if (node.op_type == "MatMul" or node.op_type == "Gemm") and any(
input in data for input in node.input input in data for input in node.input
): ):
# FIXME(constroy): the last MatMul should not be sharded as TP.
if (
node.output[0] in output
or (
index + 1 < len(model.graph.node)
and model.graph.node[index + 1].output[0]
)
in output
):
continue
groups = 1 groups = 1
# If the Gemm or Matmul is followed by a split, then the inputs are concatinated by groups # If the Gemm or Matmul is followed by a split, then the inputs are concatinated by groups
split_node = find_successor("Split", index, search_limit=2) split_node = find_successor("Split", index, search_limit=2)
@ -218,7 +229,7 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
new_input = [] new_input = []
for info in model.graph.input: for info in model.graph.input:
new_input.append(vinfo[info.name]) new_input.append(vinfo[info.name])
graph = helper.make_graph( graph = helper.make_graph(
nodes, nodes,
model.graph.name + f"_{tp_rank}", model.graph.name + f"_{tp_rank}",
@ -233,5 +244,5 @@ def parallel_model(model: ModelProto, tp_world_size: int = 1, tp_rank: int = 0):
if tt.HasField("shape"): if tt.HasField("shape"):
tt.ClearField("shape") tt.ClearField("shape")
model = helper.make_model(graph) model = helper.make_model(graph)
model = onnx.shape_inference.infer_shapes(model) #model = onnx.shape_inference.infer_shapes(model)
return model return model

View File

@ -0,0 +1,178 @@
import argparse
import torch
from transformers import BertModel, BertConfig
from transformers import GPT2Model, GPT2Config
from transformers import OPTModel, OPTConfig
import time
import numpy as np
import onnx
import os
from onnx.external_data_helper import convert_model_to_external_data
from onnxsim import simplify
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False
def parse_args():
parser = argparse.ArgumentParser(description="Run pytorch gpt2/bert/opt and optionally export onnx.")
parser.add_argument(
"--model", type=str, choices=["gpt2", "bert", "opt"], required=True, help="model type"
)
parser.add_argument("--batch_size", type=int, default=1, help="batch size.")
parser.add_argument("--length", type=int, default=1, help="sequence length.")
parser.add_argument(
"--export_onnx",
type=str,
nargs="?",
default=None,
const="./",
help="whether and where to export onnx file",
)
args = parser.parse_args()
args = parser.parse_args()
print("arg setting: ", args)
return (
args.model,
args.batch_size,
args.length,
args.export_onnx
)
def get_model(modelname):
match modelname:
case "bert":
model = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, hidden_act="gelu_new") # erf is not impl by infini
voc_size = BertConfig().vocab_size
case "gpt2":
model = GPT2Model.from_pretrained("gpt2")
voc_size = GPT2Config().vocab_size
case "opt":
model = model = OPTModel.from_pretrained("./opt-125m")
voc_size = OPTConfig().vocab_size
case _:
raise KeyError(modelname)
model = model.eval()
return model, voc_size
def run_pytorch(torch_model, voc_size, batchsize, len):
data = np.random.randint(0, voc_size, (batchsize, len), dtype=np.int32)
np.save("test_inputs", data)
inputs = torch.from_numpy(data).to("cuda")
torch_model = torch_model.to("cuda")
n_iter = 20
with torch.no_grad():
for _ in range(10):
outputs = torch_model(inputs)
torch.cuda.synchronize()
begin = time.time()
with torch.no_grad():
for _ in range(n_iter):
torch.cuda.synchronize()
outputs = torch_model(inputs)
#
torch.cuda.synchronize()
torch.cuda.synchronize()
end = time.time()
avg_time = (end - begin) / n_iter
outputs = outputs.last_hidden_state.to("cpu")
print("outputs abs mean:", abs(np.array(outputs)).mean())
print(f"average time: {avg_time}")
torch.cuda.memory.empty_cache()
np.save("test_results", np.array(outputs))
print("Save input & output as test_inputs.npy and test_results.npy")
def export_onnx(model, data, path, extern=False):
torch.onnx.export(model, data, path, verbose=False, do_constant_folding=True)
onnx_model = onnx.load(path)
onnx_model, check = simplify(onnx_model, skipped_optimizers=['eliminate_duplicate_initializer'])
#onnx_model, check = simplify(onnx_model, skipped_optimizers=['fuse_qkv', 'eliminate_duplicate_initializer'])
assert check
add_value_info_for_constants(onnx_model)
onnx_model = onnx.shape_inference.infer_shapes(onnx_model)
if extern:
extern_path = path.replace('.onnx', '.pb')
if os.path.exists(extern_path):
os.remove(extern_path)
convert_model_to_external_data(
onnx_model,
all_tensors_to_one_file=True,
location=extern_path,
size_threshold=1024,
convert_attribute=False,
)
onnx.save(onnx_model, path)
def add_value_info_for_constants(model : onnx.ModelProto):
"""
Currently onnx.shape_inference doesn't use the shape of initializers, so add
that info explicitly as ValueInfoProtos.
Mutates the model.
Args:
model: The ModelProto to update.
"""
# All (top-level) constants will have ValueInfos before IRv4 as they are all inputs
if model.ir_version < 4:
return
def add_const_value_infos_to_graph(graph : onnx.GraphProto):
inputs = {i.name for i in graph.input}
existing_info = {vi.name: vi for vi in graph.value_info}
for init in graph.initializer:
# Check it really is a constant, not an input
if init.name in inputs:
continue
# The details we want to add
elem_type = init.data_type
shape = init.dims
# Get existing or create new value info for this constant
vi = existing_info.get(init.name)
if vi is None:
vi = graph.value_info.add()
vi.name = init.name
# Even though it would be weird, we will not overwrite info even if it doesn't match
tt = vi.type.tensor_type
if tt.elem_type == onnx.TensorProto.UNDEFINED:
tt.elem_type = elem_type
if not tt.HasField("shape"):
# Ensure we set an empty list if the const is scalar (zero dims)
tt.shape.dim.extend([])
for dim in shape:
tt.shape.dim.add().dim_value = dim
# Handle subgraphs
for node in graph.node:
for attr in node.attribute:
# Ref attrs refer to other attrs, so we don't need to do anything
if attr.ref_attr_name != "":
continue
if attr.type == onnx.AttributeProto.GRAPH:
add_const_value_infos_to_graph(attr.g)
if attr.type == onnx.AttributeProto.GRAPHS:
for g in attr.graphs:
add_const_value_infos_to_graph(g)
return add_const_value_infos_to_graph(model.graph)
def main():
modelname, batchsize, seqlen, export_path = parse_args()
model, voc_size = get_model(modelname)
if export_path is not None:
filename = "{}_{}_{}.onnx".format(modelname, batchsize, seqlen)
path = os.path.join(export_path, filename)
param = torch.zeros((batchsize, seqlen), dtype=torch.int)
export_onnx(model, param, path, True)
run_pytorch(model, voc_size, batchsize, seqlen)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,145 @@
import os
from pyinfinitensor.onnx import OnnxStub, backend
import numpy as np
import onnx
import torch
from transformers import LlamaModel, LlamaForCausalLM
from tqdm import tqdm
import onnx_graphsurgeon as gs
from onnxsim import simplify
import argparse
parser = argparse.ArgumentParser(description='')
parser.add_argument('--batchsize', dest='batchsize', type=int, default=1)
parser.add_argument('--layer', dest='n_layers', type=int, default=2)
parser.add_argument('--iter', dest='n_iter', type=int, default=1)
parser.add_argument('--n_max_length', dest='n_max_length', type=int, default=1024)
parser.add_argument('--pretrained_llama_path', dest='pretrained_llama_path', type=str,
default="/data0/shared/data/public/opensource_models/meta-llama/Llama-2-7b-hf/")
parser.add_argument('--onnx_model_path', dest='onnx_model_path', type=str,
default="/data1/shared/llama")
args = parser.parse_args()
ONNX_MODEL_PATH = "{}/llama_bs{}_layer{}.onnx".format(args.onnx_model_path, args.batchsize, args.n_layers)
ONNX_WEIGHT_PATH = "./llama_bs{}_layer{}.pb".format(args.batchsize, args.n_layers)
def export_onnx(model: LlamaModel, ONNX_MODEL_PATH):
param = torch.zeros(
(args.batchsize, 1024), dtype=torch.long)
logits = model(param, past_key_values=None)
param_kvcache = torch.zeros((args.batchsize, 1), dtype=torch.long)
torch.onnx.export(model, (param_kvcache, {"past_key_values": logits.past_key_values,
"position_ids": param_kvcache}), ONNX_MODEL_PATH, verbose=False,
do_constant_folding=True,)
onnx_model = onnx.load(ONNX_MODEL_PATH)
print("simplifing onnx model")
onnx_model, check = simplify(onnx_model, skipped_optimizers=[
'eliminate_duplicate_initializer'])
assert check
onnx.save(onnx_model, ONNX_MODEL_PATH, save_as_external_data=True, location=ONNX_WEIGHT_PATH)
print("simlifing finished.")
@gs.Graph.register()
def replace_with_attention(self, inputs, outputs, inputs_added, outputs_removed):
for inp in inputs:
inp.outputs.clear()
for out in outputs:
out.inputs.clear()
for inp in inputs_added:
inputs.append(inp)
for out in outputs_removed:
out.inputs.clear()
return self.layer(op="AttentionKVCache", inputs=inputs, outputs=outputs)
def replace_onnx_with_attention_op():
graph = gs.import_onnx(
onnx.load(ONNX_MODEL_PATH))
tmap = graph.tensors()
for i in range(args.n_layers):
inputs = [
tmap["onnx::Concat_" + str((i+1)*2)],
tmap["onnx::Concat_" + str((i+1)*2+1)],
tmap["/model/layers." + str(i) + "/self_attn/Add_output_0"],
tmap["/model/layers." + str(i) + "/self_attn/Add_1_output_0"],
tmap["/model/layers." + str(i) + "/self_attn/Transpose_2_output_0"]]
outputs = [
tmap["/model/layers." + str(i) + "/self_attn/MatMul_1_output_0"]]
inputs_added = [graph.inputs[1]]
outputs_removed = []
graph.replace_with_attention(
inputs, outputs, inputs_added, outputs_removed)
graph.outputs = [tmap[graph.outputs[0].name]]
graph.cleanup(True).toposort()
onnx.save(gs.export_onnx(graph), ONNX_MODEL_PATH, save_as_external_data=True)
if __name__ == "__main__":
kvcache_torch = None
torch_model = LlamaForCausalLM.from_pretrained(
args.pretrained_llama_path, num_hidden_layers=int(args.n_layers)).eval()
n_heads = torch_model.config.num_attention_heads
n_dims = torch_model.config.hidden_size // n_heads
if not os.path.exists(ONNX_MODEL_PATH):
print("exporting onnx graph")
export_onnx(torch_model, ONNX_MODEL_PATH)
replace_onnx_with_attention_op()
else:
print("will use exsiting onnx graph")
onnx_model = onnx.load(ONNX_MODEL_PATH)
stub = OnnxStub(onnx_model, backend.cuda_runtime())
count_wrong = 0
for i in tqdm(range(0, args.n_max_length)):
query = np.random.randint(
torch_model.config.vocab_size, size=(args.batchsize, 1), dtype=np.int32)
position_id = i*np.ones((args.batchsize, 1), dtype=np.int32)
####################################
# pytorch
####################################
outputs_torch = torch_model(
torch.tensor(query), past_key_values=kvcache_torch)
logit_torch = outputs_torch['logits']
kvcache_torch = outputs_torch['past_key_values']
####################################
# infinitensor
####################################
# copyin input
(list(stub.inputs.items()))[0][1].copyin_int64(
query.reshape(-1).tolist())
(list(stub.inputs.items()))[1][1].copyin_int64(
position_id.reshape(-1).tolist())
stub.run()
####################################
# validation
####################################
# copyout output
logits_it = np.array((list(stub.outputs.items()))
[0][1].copyout_float())
try:
np.testing.assert_allclose(
logit_torch[:, -1, :].detach().cpu().numpy().flatten(), logits_it, rtol=1e-3, atol=1e-3)
except Exception as e:
try:
np.testing.assert_allclose(
np.argmax(logit_torch[:, -1, :].detach().cpu().numpy().flatten()), np.argmax(logits_it), rtol=1e-3, atol=1e-3)
except:
count_wrong = count_wrong + 1
result = "{}/{} failed.".format(count_wrong, args.n_max_length)
print(result)
del stub

View File

@ -0,0 +1,29 @@
import sys
import onnx
import torch
import numpy as np
from pyinfinitensor.onnx import OnnxStub, backend
if __name__ == '__main__':
args = sys.argv
if len(sys.argv) != 2:
print("Usage: python onnx_inference.py model_name.onnx")
exit()
model_path = sys.argv[1]
# print(model_path)
onnx_model = onnx.load(model_path)
onnx_input = onnx_model.graph.input[0]
input_shape = [[d.dim_value for d in _input.type.tensor_type.shape.dim]
for _input in onnx_model.graph.input]
# Assume that there is only one input tensor
input_shape = input_shape[0]
# print(input_shape)
input_data = np.random.random(input_shape).astype(np.float32)
model = OnnxStub(onnx_model, backend.cuda_runtime())
next(iter(model.inputs.values())).copyin_numpy(input_data)
model.run()
outputs = next(iter(model.outputs.values())).copyout_numpy()
outputs = torch.tensor(outputs)
print(outputs.shape)

View File

@ -0,0 +1,80 @@
import paddle
import paddle.vision.transforms as T
from paddle.vision.datasets import Cifar10
from pyinfinitensor.onnx import OnnxStub, backend
import onnx
import itertools
def run_cifar_train_and_infer():
paddle.device.set_device("gpu")
transform = T.Compose(
[
T.Resize(224),
T.ToTensor(),
T.Normalize(
mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5],
to_rgb=True,
),
]
)
# 下载数据集并初始化 DataSet
train_dataset = paddle.vision.datasets.Cifar10(mode='train', transform=transform)
test_dataset = paddle.vision.datasets.Cifar10(mode='test', transform=transform)
# 模型组网并初始化网络
densenet = paddle.vision.models.DenseNet(num_classes=10)
model = paddle.Model(densenet)
# 模型训练的配置准备,准备损失函数,优化器和评价指标
model.prepare(paddle.optimizer.Adam(parameters=model.parameters()),
paddle.nn.CrossEntropyLoss(),
paddle.metric.Accuracy())
# 模型训练
model.fit(train_dataset, epochs=5, batch_size=64, verbose=1)
# 模型评估
model.evaluate(test_dataset, batch_size=64, verbose=1)
# export to ONNX
save_path = 'onnx.save/densenet' # 需要保存的路径
x_spec = paddle.static.InputSpec([1, 3, 224, 224], 'float32', 'x') # 为模型指定输入的形状和数据类型,支持持 Tensor 或 InputSpec InputSpec 支持动态的 shape。
paddle.onnx.export(densenet, save_path, input_spec=[x_spec], opset_version=11)
# 加载onnx模型并放到Infinitensor中
model_path = save_path + ".onnx"
onnx_model = onnx.load(model_path)
gofusion_model = OnnxStub(onnx_model, backend.cuda_runtime())
model = gofusion_model
model.init()
# 启动推理
cifar10_test = Cifar10(
mode="test",
transform=transform, # apply transform to every image
backend="cv2", # use OpenCV as image transform backend
)
batch_size = 1
total_size = 0
total_acc = 0.0
for data in itertools.islice(iter(cifar10_test), 10000):
images, labels = data
next(model.inputs.items().__iter__())[1].copyin_float(images.reshape([3*224*224]).tolist())
model.run()
outputs = next(model.outputs.items().__iter__())[1].copyout_float()
outputs = paddle.to_tensor(outputs)
outputs = paddle.reshape(outputs, (1, 10))
labels = paddle.to_tensor(labels)
labels = paddle.reshape(labels, (1,1))
acc = paddle.metric.accuracy(outputs, labels)
total_acc += acc
total_size += batch_size
print("test acc: {}".format(total_acc.numpy() / total_size))
if __name__ == "__main__":
run_cifar_train_and_infer()

View File

@ -0,0 +1,80 @@
import paddle
import paddle.vision.transforms as T
from paddle.vision.datasets import Cifar10
from pyinfinitensor.onnx import OnnxStub, backend
import onnx
import itertools
def run_cifar_train_and_infer():
paddle.device.set_device("gpu")
transform = T.Compose(
[
T.Resize(224),
T.ToTensor(),
T.Normalize(
mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5],
to_rgb=True,
),
]
)
# 下载数据集并初始化 DataSet
train_dataset = paddle.vision.datasets.Cifar10(mode='train', transform=transform)
test_dataset = paddle.vision.datasets.Cifar10(mode='test', transform=transform)
# 模型组网并初始化网络
inception = paddle.vision.models.InceptionV3(num_classes=10)
model = paddle.Model(inception)
# 模型训练的配置准备,准备损失函数,优化器和评价指标
model.prepare(paddle.optimizer.Adam(parameters=model.parameters()),
paddle.nn.CrossEntropyLoss(),
paddle.metric.Accuracy())
# 模型训练
model.fit(train_dataset, epochs=5, batch_size=64, verbose=1)
# 模型评估
model.evaluate(test_dataset, batch_size=64, verbose=1)
# export to ONNX
save_path = 'onnx.save/inception' # 需要保存的路径
x_spec = paddle.static.InputSpec([1, 3, 224, 224], 'float32', 'x') # 为模型指定输入的形状和数据类型,支持持 Tensor 或 InputSpec InputSpec 支持动态的 shape。
paddle.onnx.export(inception, save_path, input_spec=[x_spec], opset_version=11)
# 加载onnx模型并放到Infinitensor中
model_path = save_path + ".onnx"
onnx_model = onnx.load(model_path)
gofusion_model = OnnxStub(onnx_model, backend.cuda_runtime())
model = gofusion_model
model.init()
# 启动推理
cifar10_test = Cifar10(
mode="test",
transform=transform, # apply transform to every image
backend="cv2", # use OpenCV as image transform backend
)
batch_size = 1
total_size = 0
total_acc = 0.0
for data in itertools.islice(iter(cifar10_test), 10000):
images, labels = data
next(model.inputs.items().__iter__())[1].copyin_float(images.reshape([3*224*224]).tolist())
model.run()
outputs = next(model.outputs.items().__iter__())[1].copyout_float()
outputs = paddle.to_tensor(outputs)
outputs = paddle.reshape(outputs, (1, 10))
labels = paddle.to_tensor(labels)
labels = paddle.reshape(labels, (1,1))
acc = paddle.metric.accuracy(outputs, labels)
total_acc += acc
total_size += batch_size
print("test acc: {}".format(total_acc.numpy() / total_size))
if __name__ == "__main__":
run_cifar_train_and_infer()

View File

@ -0,0 +1,31 @@
## Description
This is a doc to tell you how to run paddle*.py in your machine. If your model run on other machines except Nvidia, you may need to make some change.
## What do we do in paddle*.py files?
1. Train model and evalute model with Cifar10 dataset
2. Export paddle model to onnx model
3. Load onnx model, infer with InfiniTensor and calculate the inference accuracy
## Command
1. Go to `/examples/python` folder
2. Run the following command
1. ```
python paddle_resnet.py
python paddle_densenet.py
python paddle_inception.py
```
## What should I do if I use other device(MLU, XPU, NPU)?
You need to change this code:
```
paddle.device.set_device("gpu") # Change gpu to mlu, xpu or npu
```

View File

@ -0,0 +1,81 @@
import paddle
import paddle.vision.transforms as T
from paddle.vision.datasets import Cifar10
from pyinfinitensor.onnx import OnnxStub, backend
import onnx
import itertools
from paddle.vision.models.resnet import BasicBlock
def run_cifar_train_and_infer():
paddle.device.set_device("gpu")
transform = T.Compose(
[
T.Resize(224),
T.ToTensor(),
T.Normalize(
mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5],
to_rgb=True,
),
]
)
# 下载数据集并初始化 DataSet
train_dataset = paddle.vision.datasets.Cifar10(mode='train', transform=transform)
test_dataset = paddle.vision.datasets.Cifar10(mode='test', transform=transform)
# 模型组网并初始化网络
resnet = paddle.vision.models.ResNet(BasicBlock, depth=18, num_classes=10)
model = paddle.Model(resnet)
# 模型训练的配置准备,准备损失函数,优化器和评价指标
model.prepare(paddle.optimizer.Adam(parameters=model.parameters()),
paddle.nn.CrossEntropyLoss(),
paddle.metric.Accuracy())
# 模型训练
model.fit(train_dataset, epochs=5, batch_size=64, verbose=1)
# 模型评估
model.evaluate(test_dataset, batch_size=64, verbose=1)
# export to ONNX
save_path = 'onnx.save/resnet' # 需要保存的路径
x_spec = paddle.static.InputSpec([1, 3, 224, 224], 'float32', 'x') # 为模型指定输入的形状和数据类型,支持持 Tensor 或 InputSpec InputSpec 支持动态的 shape。
paddle.onnx.export(resnet, save_path, input_spec=[x_spec], opset_version=11)
# 加载onnx模型并放到Infinitensor中
model_path = save_path + ".onnx"
onnx_model = onnx.load(model_path)
gofusion_model = OnnxStub(onnx_model, backend.cuda_runtime())
model = gofusion_model
model.init()
# 启动推理
cifar10_test = Cifar10(
mode="test",
transform=transform, # apply transform to every image
backend="cv2", # use OpenCV as image transform backend
)
batch_size = 1
total_size = 0
total_acc = 0.0
for data in itertools.islice(iter(cifar10_test), 10000):
images, labels = data
next(model.inputs.items().__iter__())[1].copyin_float(images.reshape([3*224*224]).tolist())
model.run()
outputs = next(model.outputs.items().__iter__())[1].copyout_float()
outputs = paddle.to_tensor(outputs)
outputs = paddle.reshape(outputs, (1, 10))
labels = paddle.to_tensor(labels)
labels = paddle.reshape(labels, (1,1))
acc = paddle.metric.accuracy(outputs, labels)
total_acc += acc
total_size += batch_size
print("test acc: {}".format(total_acc.numpy() / total_size))
if __name__ == "__main__":
run_cifar_train_and_infer()

View File

@ -0,0 +1,24 @@
import sys
import onnx
import torch
import numpy as np
from pyinfinitensor.onnx import OnnxStub, backend
import torchvision.models as models
if __name__ == '__main__':
model_path = './resnet18.onnx'
tv_model = models.resnet50(weights=None)
input_shape = (1, 3, 224, 224)
param = torch.rand(input_shape)
torch.onnx.export(tv_model, param, model_path, verbose=False)
onnx_model = onnx.load(model_path)
model = OnnxStub(onnx_model, backend.cuda_runtime())
images = np.random.random(input_shape).astype(np.float32)
next(iter(model.inputs.values())).copyin_numpy(images)
model.run()
outputs = next(iter(model.outputs.values())).copyout_numpy()
outputs = torch.tensor(outputs)
outputs = torch.reshape(outputs, (1, 1000))
_, predicted = torch.max(outputs, 1)
print(predicted)

View File

@ -2,6 +2,10 @@
#include "cnnl.h" #include "cnnl.h"
#include "cnrt.h" #include "cnrt.h"
#include "core/common.h" #include "core/common.h"
#include "core/data_type.h"
#ifdef INFINI_USE_CNCL
#include "cncl.h"
#endif
#define checkBangError(call) \ #define checkBangError(call) \
{ \ { \
@ -27,4 +31,70 @@ namespace infini {
using BangPtr = void *; using BangPtr = void *;
inline cnnlDataType_t cnnlDataTypeConvert(DataType dataType) {
if (dataType == DataType::Float32) {
return CNNL_DTYPE_FLOAT;
}
if (dataType == DataType::Float16) {
return CNNL_DTYPE_HALF;
}
if (dataType == DataType::Double) {
return CNNL_DTYPE_DOUBLE;
}
if (dataType == DataType::Int8) {
return CNNL_DTYPE_INT8;
}
if (dataType == DataType::Int32) {
return CNNL_DTYPE_INT32;
}
if (dataType == DataType::UInt8) {
return CNNL_DTYPE_UINT8;
}
if (dataType == DataType::BFloat16) {
return CNNL_DTYPE_BFLOAT16;
}
if (dataType == DataType::Int64) {
return CNNL_DTYPE_INT64;
}
if (dataType == DataType::Bool) {
return CNNL_DTYPE_BOOL;
}
IT_TODO_HALT_MSG("Data type " + dataType.toString() +
" not supported in CNNL.");
}
#ifdef INFINI_USE_CNCL
inline cnclDataType_t cnclDataTypeConvert(DataType dataType) {
if (dataType == DataType::Float32) {
return cnclFloat32;
}
if (dataType == DataType::Float16) {
return cnclHalf;
}
if (dataType == DataType::Int8) {
return cnclInt8;
}
if (dataType == DataType::Int16) {
return cnclInt16;
}
if (dataType == DataType::Int32) {
return cnclInt32;
}
if (dataType == DataType::UInt8) {
return cnclUint8;
}
if (dataType == DataType::UInt16) {
return cnclUint16;
}
if (dataType == DataType::UInt32) {
return cnclUint32;
}
if (dataType == DataType::BFloat16) {
return cnclBfloat16;
}
IT_TODO_HALT_MSG("Data type " + dataType.toString() +
" not supported in CNCL.");
}
#endif
} // namespace infini } // namespace infini

View File

@ -7,16 +7,19 @@ namespace infini {
class BangRuntimeObj : public RuntimeObj { class BangRuntimeObj : public RuntimeObj {
private: private:
cnnlHandle_t cnnl; cnnlHandle_t cnnl;
cnrtQueue_t queue;
std::unique_ptr<CommunicatorObj> comm;
BangPtr workspace; BangPtr workspace;
size_t workspaceSize; size_t workspaceSize;
mutable size_t cursor;
public: public:
BangRuntimeObj() : RuntimeObj(Device::BANG) { explicit BangRuntimeObj(int deviceId = 0)
: RuntimeObj(Device::BANG, deviceId) {
cnInit(0); cnInit(0);
CNdev dev; CNdev dev;
cnDeviceGet(&dev, 0); cnDeviceGet(&dev, deviceId);
checkBangError(cnrtSetDevice(dev)); checkBangError(cnrtSetDevice(dev));
cnrtQueue_t queue;
checkBangError(cnrtQueueCreate(&queue)); checkBangError(cnrtQueueCreate(&queue));
checkCnnlError(cnnlCreate(&cnnl)); checkCnnlError(cnnlCreate(&cnnl));
@ -24,10 +27,12 @@ class BangRuntimeObj : public RuntimeObj {
// 10GB for Longformer // 10GB for Longformer
// size_t longformerNum = 3lu * (1 << 30); // size_t longformerNum = 3lu * (1 << 30);
workspaceSize = 7ll << 30; // 7 GB workspaceSize = 7ll << 30; // 7 GB
cursor = 0;
workspace = alloc(workspaceSize); workspace = alloc(workspaceSize);
} }
virtual ~BangRuntimeObj() { virtual ~BangRuntimeObj() {
dealloc(workspace); dealloc(workspace);
checkBangError(cnrtQueueDestroy(queue));
checkCnnlError(cnnlDestroy(cnnl)); checkCnnlError(cnnlDestroy(cnnl));
} }
string toString() const override; string toString() const override;
@ -45,10 +50,15 @@ class BangRuntimeObj : public RuntimeObj {
void dealloc(void *ptr) override { checkBangError(cnrtFree(ptr)); } void dealloc(void *ptr) override { checkBangError(cnrtFree(ptr)); }
cnnlHandle_t cnnlHandle() const { return cnnl; } cnnlHandle_t cnnlHandle() const { return cnnl; }
BangPtr getWorkspace(size_t size) const { BangPtr getWorkspace(size_t size) const {
IT_ASSERT(size <= workspaceSize); IT_ASSERT((cursor + size) <= workspaceSize);
return workspace; cursor += size;
void *temp = workspace;
temp += (cursor - size);
return temp;
} }
void resetWorkspace() const { cursor = 0; }
void copyBlobFromCPU(void *dst, const void *src, void copyBlobFromCPU(void *dst, const void *src,
size_t bytes) const override { size_t bytes) const override {
checkBangError(cnrtMemcpy(dst, const_cast<void *>(src), bytes, checkBangError(cnrtMemcpy(dst, const_cast<void *>(src), bytes,
@ -66,10 +76,9 @@ class BangRuntimeObj : public RuntimeObj {
checkBangError(cnrtMemcpy(dst, const_cast<void *>(src), bytes, checkBangError(cnrtMemcpy(dst, const_cast<void *>(src), bytes,
CNRT_MEM_TRANS_DIR_PEER2PEER)); CNRT_MEM_TRANS_DIR_PEER2PEER));
} }
void initComm(const string &name, int worldSize, int rank) final;
void initComm(const string &, int, int) override { IT_TODO_HALT(); } CommunicatorObj &getCommunicator() const override { return *comm; }
cnrtQueue_t getBangQueue() const { return queue; }
CommunicatorObj &getCommunicator() const override { IT_TODO_HALT(); }
private: private:
void runWithoutSync(const Graph &graph, bool tune, bool profiling) const; void runWithoutSync(const Graph &graph, bool tune, bool profiling) const;

View File

@ -0,0 +1,79 @@
#pragma once
#include "bang_common.h"
#include "core/communicator.h"
#include <chrono>
#include <cncl.h>
#include <cnrt.h>
#include <cstdlib>
#include <filesystem>
#include <fstream>
#include <mutex>
#include <thread>
namespace infini {
class CnclCommunicatorObj final : public CommunicatorObj {
private:
cnclComm_t *comms;
public:
CnclCommunicatorObj(const string &name, int worldSize, int rank)
: CommunicatorObj(worldSize, rank) {
const std::string filePath("./" + name + "_cncl_id.bin");
cnclCliqueId clique_id;
if (rank == 0) {
CNCL_CHECK(cnclGetCliqueId(&clique_id));
std::ofstream ofs(filePath, std::ios::binary);
ofs.write((char *)&clique_id, sizeof(cnclCliqueId));
} else {
auto begin = std::chrono::steady_clock::now();
while (!std::filesystem::exists(filePath)) {
auto now = std::chrono::steady_clock::now();
_IT_ASSERT_2(now < begin + std::chrono::seconds(10),
"time limit (10s) exceeded.");
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
std::ifstream ifs(filePath, std::ios::binary);
ifs.read((char *)&clique_id, sizeof(cnclCliqueId));
}
int num_comms = 1;
int *dev_list = new int[num_comms];
int *rank_list = new int[num_comms];
comms = new cnclComm_t[num_comms];
uint32_t num_dev = 0;
checkBangError(cnrtGetDeviceCount(&num_dev));
for (int i = 0; i < num_comms; i++) {
rank_list[i] = rank;
dev_list[i] = rank_list[i] % num_dev;
}
CNCL_CHECK(cnclInitComms(comms, num_comms, dev_list, rank_list,
worldSize, &clique_id));
if (rank == 0) {
std::filesystem::remove(filePath);
}
delete[] dev_list;
delete[] rank_list;
}
~CnclCommunicatorObj() {
CNCL_CHECK(cnclDestroyComms(comms, 1));
delete[] comms;
}
// Get the actual cnclComm_t
cnclComm_t getCnclComm() { return comms[0]; }
virtual string toString() const final {
std::ostringstream oss;
oss << "CNCL communicator";
return oss.str();
}
};
} // namespace infini

View File

@ -61,21 +61,35 @@ template <typename T> auto enum_to_underlying(T e) {
} }
template <typename T> std::string vecToString(const std::vector<T> &vec) { template <typename T> std::string vecToString(const std::vector<T> &vec) {
std::string ret; std::stringstream ss;
ret.append("["); ss << "[";
for (auto d : vec) { for (size_t i = 0; i < vec.size(); ++i) {
ret.append(std::to_string(d)); ss << vec.at(i);
ret.append(","); if (i < vec.size() - 1) {
ss << ",";
}
} }
if (!vec.empty()) ss << "]";
ret.pop_back(); return ss.str();
ret.append("]"); }
return ret;
template <typename T> std::string vecToString(const T *st, size_t length) {
std::stringstream ss;
ss << "[";
size_t i = 0;
for (i = 0; i < length; i++) {
ss << *(st + i);
if (i < length - 1) {
ss << ",";
}
}
ss << "]";
return ss.str();
} }
double timeit( double timeit(
const std::function<void()> &func, const std::function<void()> &func,
const std::function<void(void)> &sync = []() {}, int warmupRounds = 200, const std::function<void(void)> &sync = []() {}, int warmupRounds = 10,
int timingRounds = 200); int timingRounds = 10);
} // namespace infini } // namespace infini

View File

@ -53,6 +53,7 @@ class GraphObj : public Object {
const TensorVec &getTensors() const { return tensors; } const TensorVec &getTensors() const { return tensors; }
const OpVec &getOperators() const { return ops; } const OpVec &getOperators() const { return ops; }
OpVec getComputeOps() const; OpVec getComputeOps() const;
Tensor getTensor(int) const;
/** /**
* Sort the nodes in topological order. * Sort the nodes in topological order.
@ -64,7 +65,13 @@ class GraphObj : public Object {
void optimize(); void optimize();
void dataMalloc(); void shape_infer();
void dataMalloc(bool useNaiveAllocator = false, size_t memPoolSize = 0);
Tensor cloneKV(Tensor &tensor);
void freeHeap();
/** /**
* @brief Add an operator and create its outputs. Output tensor arguments * @brief Add an operator and create its outputs. Output tensor arguments

View File

@ -5,6 +5,10 @@
#include <cstdint> #include <cstdint>
#include <iostream> #include <iostream>
#ifdef USE_CUDA
#include "cuda/cuda_runtime.h"
#endif
namespace infini { namespace infini {
class GraphHandlerObj { class GraphHandlerObj {
@ -26,15 +30,19 @@ class GraphHandlerObj {
int pw, int sh, int sw, int dh, int dw, int oph, int pw, int sh, int sw, int dh, int dw, int oph,
int opw); int opw);
Tensor matmul(Tensor a, Tensor b, Tensor y, bool transA, bool transB, Tensor matmul(Tensor a, Tensor b, Tensor y, bool transA, bool transB,
Tensor bias, ActType act); Tensor bias, ActType act,
std::string matmul_compute_type = "default");
Tensor batchNormalization(Tensor input, Tensor output, Tensor mean, Tensor batchNormalization(Tensor input, Tensor output, Tensor mean,
Tensor var, Tensor scale, Tensor bias, Tensor var, Tensor scale, Tensor bias,
float momentum, float eps, bool training); float momentum, float eps, bool training);
Tensor layerNormalization(Tensor input, Tensor scale, Tensor output,
Tensor bias, float eps, int axis, int stash_type);
Tensor rmsNorm(Tensor input, Tensor weight, Tensor output);
Tensor maxPool(Tensor input, Tensor output, int kh, int kw, int dh, int dw, Tensor maxPool(Tensor input, Tensor output, int kh, int kw, int dh, int dw,
int ph, int pw, int sh, int sw); int ph, int pw, int sh, int sw, int ceilMode);
Tensor avgPool(Tensor input, Tensor output, int kh, int kw, int dh, int dw, Tensor avgPool(Tensor input, Tensor output, int kh, int kw, int dh, int dw,
int ph, int pw, int sh, int sw); int ph, int pw, int sh, int sw, int ceilMode);
Tensor add(Tensor a, Tensor b, Tensor c); Tensor add(Tensor a, Tensor b, Tensor c);
Tensor sub(Tensor a, Tensor b, Tensor c); Tensor sub(Tensor a, Tensor b, Tensor c);
@ -45,12 +53,17 @@ class GraphHandlerObj {
Tensor max(Tensor a, Tensor b, Tensor c); Tensor max(Tensor a, Tensor b, Tensor c);
Tensor relu(Tensor x, Tensor y); Tensor relu(Tensor x, Tensor y);
Tensor silu(Tensor x, Tensor y);
Tensor gelu(Tensor x, Tensor y);
Tensor sigmoid(Tensor x, Tensor y); Tensor sigmoid(Tensor x, Tensor y);
Tensor hardSigmoid(Tensor x, Tensor y);
Tensor hardSwish(Tensor x, Tensor y);
Tensor tanh(Tensor x, Tensor y); Tensor tanh(Tensor x, Tensor y);
Tensor erf(Tensor x, Tensor y); Tensor erf(Tensor x, Tensor y);
Tensor softmax(Tensor x, Tensor y, int axis); Tensor softmax(Tensor x, Tensor y, int axis);
Tensor abs(Tensor x, Tensor y); Tensor abs(Tensor x, Tensor y);
Tensor sqrt(Tensor x, Tensor y); Tensor sqrt(Tensor x, Tensor y);
Tensor neg(Tensor x, Tensor y);
Tensor shape(Tensor x, Tensor y); Tensor shape(Tensor x, Tensor y);
Tensor identity(Tensor x, Tensor y); Tensor identity(Tensor x, Tensor y);
Tensor flatten(Tensor s, Tensor y, int axis); Tensor flatten(Tensor s, Tensor y, int axis);
@ -59,12 +72,27 @@ class GraphHandlerObj {
std::optional<float> max); std::optional<float> max);
Tensor transpose(Tensor data, Tensor transposed, Shape perm); Tensor transpose(Tensor data, Tensor transposed, Shape perm);
Tensor reshape(Tensor data, Tensor reshaped, Shape shape); Tensor reshape(Tensor data, Tensor reshaped, Shape shape);
Tensor resize(Tensor input, Tensor output,
const std::optional<vector<int>> &axes, Tensor sizes,
Tensor scales, Tensor roi, vector<uint32_t> sizes_,
vector<float> scales_, vector<float> roi_, string mode,
string ratioPolicy, string nearestMode,
string coordTransMode);
Tensor squeeze(Tensor input, Tensor output, Shape axes);
Tensor unsqueeze(Tensor input, Tensor output, Shape axes);
Tensor concat(TensorVec inputs, Tensor output, int dim); Tensor concat(TensorVec inputs, Tensor output, int dim);
Tensor attentionKVCache(Tensor input_k_cache, Tensor input_v_cache,
Tensor input_q, Tensor input_k, Tensor input_v,
Tensor position_id, Tensor output_matmul);
Tensor RoPE(Tensor pos, Tensor input, Tensor output);
TensorVec split(Tensor input, std::optional<TensorVec> outputs, int axis, TensorVec split(Tensor input, std::optional<TensorVec> outputs, int axis,
int num_outputs); std::variant<int, vector<int>> numOrRatio);
Tensor gather(Tensor data, Tensor indices, Tensor output, int axis); Tensor gather(Tensor data, Tensor indices, Tensor output, int axis);
Tensor gatherElements(Tensor data, Tensor indices, Tensor output, int axis);
Tensor reduceMean(Tensor data, Tensor reduced, Tensor reduceMean(Tensor data, Tensor reduced,
const optional<vector<int>> &axes, bool keepdims); const optional<vector<int>> &axes, bool keepdims);
Tensor reduceSum(Tensor data, Tensor reduced,
const optional<vector<int>> &axes, bool keepdims);
Tensor slice(Tensor input, Tensor output, const vector<int> &starts, Tensor slice(Tensor input, Tensor output, const vector<int> &starts,
const vector<int> &ends, const optional<vector<int>> &axes, const vector<int> &ends, const optional<vector<int>> &axes,
const optional<vector<int>> &steps); const optional<vector<int>> &steps);
@ -73,6 +101,7 @@ class GraphHandlerObj {
Tensor cast(Tensor input, Tensor output, int to); Tensor cast(Tensor input, Tensor output, int to);
Tensor expand(Tensor input, Tensor output, Shape dims); Tensor expand(Tensor input, Tensor output, Shape dims);
Tensor where(Tensor inputX, Tensor inputY, Tensor condition, Tensor output); Tensor where(Tensor inputX, Tensor inputY, Tensor condition, Tensor output);
std::vector<int> getDims(Tensor x) { return x->getDims(); }
Tensor allReduceSum(Tensor input, Tensor output); Tensor allReduceSum(Tensor input, Tensor output);
Tensor allReduceProd(Tensor input, Tensor output); Tensor allReduceProd(Tensor input, Tensor output);
@ -81,6 +110,13 @@ class GraphHandlerObj {
Tensor allReduceAvg(Tensor input, Tensor output); Tensor allReduceAvg(Tensor input, Tensor output);
TensorVec allGather(Tensor input, std::optional<TensorVec> outputs, int n); TensorVec allGather(Tensor input, std::optional<TensorVec> outputs, int n);
Tensor broadcast(Tensor input, Tensor output, int root); Tensor broadcast(Tensor input, Tensor output, int root);
Tensor send(Tensor input, int source, int destination, Tensor output);
Tensor recv(Tensor output, int source, int destination, Shape dims,
int outputType, Tensor input);
Tensor depthToSpace(Tensor input, Tensor output, int blocksize,
std::string mode);
Tensor lrn(Tensor input, Tensor output, float alpha, float beta, float bias,
int size);
//------ modifiers //------ modifiers
@ -88,15 +124,31 @@ class GraphHandlerObj {
inline void optimize() { g->optimize(); } inline void optimize() { g->optimize(); }
inline void shape_infer() { g->shape_infer(); }
void change_shape(const vector<int> &shape, int tensorId);
//------ runtime //------ runtime
inline void data_malloc() { g->dataMalloc(); } inline void data_malloc(bool useNaiveAllocator = false,
size_t memPoolSize = 0) {
g->dataMalloc(useNaiveAllocator, memPoolSize);
}
inline Tensor clone_KV(Tensor &tensor) { return g->cloneKV(tensor); }
inline void free_heap() { g->freeHeap(); }
inline void tune() { g->getRuntime()->run(g, true); } inline void tune() { g->getRuntime()->run(g, true); }
inline void run() { g->getRuntime()->run(g); } inline void run() { g->getRuntime()->run(g); }
inline double get_perf_time() { return g->getRuntime()->getPerfTime(g); } inline double get_perf_time() { return g->getRuntime()->getPerfTime(g); }
#ifdef USE_CUDA
inline void run_with_cudagraph() {
(as<CudaRuntimeObj>(g->getRuntime()))->runWithCudaGraph(g);
}
#endif
}; };
} // namespace infini } // namespace infini

View File

@ -2,10 +2,11 @@
#include "core/common.h" #include "core/common.h"
#include "core/operator.h" #include "core/operator.h"
#include "core/tensor.h" #include "core/tensor.h"
#include "utils/operator_utils.h"
#include <functional> #include <functional>
#include <nlohmann/json.hpp> #include <nlohmann/json.hpp>
using json = nlohmann::json;
namespace infini { namespace infini {
using json = nlohmann::json;
class RuntimeObj; // Forward declaration for Kernel::compute class RuntimeObj; // Forward declaration for Kernel::compute
@ -29,7 +30,6 @@ class Kernel {
public: public:
Kernel() {} Kernel() {}
virtual ~Kernel() {} virtual ~Kernel() {}
/** /**
* @param op The operator to be executed. * @param op The operator to be executed.
* @param record The parameters for kernel execution. If extra parameters * @param record The parameters for kernel execution. If extra parameters
@ -102,11 +102,9 @@ class KernelRegistry {
} }
Kernel *getKernel(const KernelAttrs &kernelAttrs) const { Kernel *getKernel(const KernelAttrs &kernelAttrs) const {
auto it = kernels.find(kernelAttrs); auto it = kernels.find(kernelAttrs);
IT_ASSERT(it != kernels.end(), IT_ASSERT(it != kernels.end(), "Kernel not found for key {" +
"Kernel not found for key {" + get_kernel_attrs_str(kernelAttrs) +
to_string(enum_to_underlying(std::get<0>(kernelAttrs))) + "}");
", " + std::to_string(std::get<1>(kernelAttrs)) + ", " +
std::get<2>(kernelAttrs).toString() + "}");
return std::get<0>(it->second); return std::get<0>(it->second);
} }
const KernelRecord &getKernelItem(const KernelAttrs &kernelAttrs) const { const KernelRecord &getKernelItem(const KernelAttrs &kernelAttrs) const {
@ -131,15 +129,16 @@ class CpuKernelWithoutConfig : public Kernel {
} // namespace infini } // namespace infini
#define _REGISTER_KERNEL_1(device, opType, dataType, kernel, name, cnt) \ #define _REGISTER_KERNEL_1(device, opType, kernel, name, cnt) \
namespace infini { \ namespace infini { \
static const bool _CAT(_register_kernel_, cnt) = \ static const bool _CAT(_register_kernel_, cnt) = \
KernelRegistry::getInstance().registerKernel( \ KernelRegistry::getInstance().registerKernel(KernelAttrs{device, \
KernelAttrs{device, opType, dataType}, new kernel(), name); \ opType}, \
new kernel(), name); \
} }
#define REGISTER_KERNEL(device, opType, dataType, kernel, name) \ #define REGISTER_KERNEL(device, opType, kernel, name) \
_REGISTER_KERNEL_1(device, opType, dataType, kernel, name, __COUNTER__) _REGISTER_KERNEL_1(device, opType, kernel, name, __COUNTER__)
#define _REGISTER_CONSTRUCTOR_1(type, constructor, cnt) \ #define _REGISTER_CONSTRUCTOR_1(type, constructor, cnt) \
namespace infini { \ namespace infini { \

View File

@ -26,14 +26,23 @@ class LazyAllocator {
size_t weightPeak = 0; size_t weightPeak = 0;
size_t heapPeak = 0;
size_t alignment; size_t alignment;
bool hasMemPool = false;
size_t memPoolSize = 0;
// pointer to the memory actually allocated // pointer to the memory actually allocated
void *ptr = nullptr; void *ptr = nullptr;
// pointer to the weight memory space // pointer to the weight memory space
void *weightPtr = nullptr; void *weightPtr = nullptr;
// memory pool ptr
void *memPoolPtr = nullptr;
// // a cache designed for a batch size that has already occurred // // a cache designed for a batch size that has already occurred
// std::unordered_map<size_t, std::unordered_map<TensorObj *, size_t>> // std::unordered_map<size_t, std::unordered_map<TensorObj *, size_t>>
// batchsizeToTensorOffset; // batchsizeToTensorOffset;
@ -68,6 +77,10 @@ class LazyAllocator {
void init(); void init();
void setMemPool(size_t memPoolSize);
bool getMemPoolStatus();
// function: simulate memory allocation // function: simulate memory allocation
// arguments // arguments
// size: size of memory block to be allocated // size: size of memory block to be allocated
@ -76,6 +89,10 @@ class LazyAllocator {
size_t allocWeight(size_t size); size_t allocWeight(size_t size);
size_t heapAlloc(size_t size);
void freeHeap();
// function: simulate memory free // function: simulate memory free
// arguments: // arguments:
// addr: head address offset of memory block to be free // addr: head address offset of memory block to be free
@ -92,6 +109,8 @@ class LazyAllocator {
void *getWeightPtr(); void *getWeightPtr();
void *getHeapPtr();
void info(); void info();
private: private:

View File

@ -21,10 +21,11 @@ struct OpType {
Add, // Binary Add, // Binary
And, // Binary And, // Binary
ArgMax, // ArgMax, //
Asin, // Binary Asin, // Unary
Asinh, // Binary Asinh, // Unary
Atan, // Binary Atan, // Unary
Atanh, // Binary Atanh, // Unary
AttentionKVCache, // Fusion
AveragePool, // Pool AveragePool, // Pool
BatchNormalization, // BatchNormalization, //
Bernoulli, // Bernoulli, //
@ -73,6 +74,7 @@ struct OpType {
GatherElements, GatherElements,
GatherND, GatherND,
Gemm, Gemm,
Gelu, // Unary
GlobalAveragePool, // GlobalPool GlobalAveragePool, // GlobalPool
GlobalLpPool, // GlobalPool GlobalLpPool, // GlobalPool
GlobalMaxPool, // GlobalPool GlobalMaxPool, // GlobalPool
@ -149,11 +151,14 @@ struct OpType {
ReduceSum, // Reduce ReduceSum, // Reduce
ReduceSumSquare, // Reduce ReduceSumSquare, // Reduce
Relu, // Unary Relu, // Unary
Silu, // Unary
Reshape, Reshape,
Resize, Resize,
ReverseSequence, ReverseSequence,
RoiAlign, RoiAlign,
Round, // Unary RoPE, // Fusion
Round, // Unary
RMSNorm, // Fusion
STFT, STFT,
Scan, Scan,
Scatter, Scatter,
@ -230,6 +235,8 @@ struct OpType {
AllReduceAvg, AllReduceAvg,
AllGather, AllGather,
Broadcast, Broadcast,
Send,
Recv,
} type; } type;
constexpr OpType(decltype(type) t) : type(t) {} constexpr OpType(decltype(type) t) : type(t) {}

View File

@ -4,7 +4,7 @@
#include "core/tensor.h" #include "core/tensor.h"
namespace infini { namespace infini {
using KernelAttrs = std::tuple<Device, OpType::underlying_t, DataType>; using KernelAttrs = std::tuple<Device, OpType::underlying_t>;
struct OpPerfKey { struct OpPerfKey {
HashType hash; HashType hash;
@ -55,8 +55,7 @@ class OperatorObj : public Object {
public: public:
OperatorObj(OpType opType, TensorVec inputs, TensorVec outputs); OperatorObj(OpType opType, TensorVec inputs, TensorVec outputs);
virtual optional<vector<Shape>> virtual optional<vector<Shape>> inferShape(const TensorVec &inputs) = 0;
inferShape(const TensorVec &inputs) const = 0;
virtual vector<DataType> inferDataType(const TensorVec &inputs) const; virtual vector<DataType> inferDataType(const TensorVec &inputs) const;
/** /**
* @brief Constructs outputs (if requried) and check whether the operator is * @brief Constructs outputs (if requried) and check whether the operator is
@ -91,6 +90,7 @@ class OperatorObj : public Object {
OpType getOpType() const { return type; } OpType getOpType() const { return type; }
// HACK: set correct data type // HACK: set correct data type
DataType getDType() const { return getInputs(0)->getDType(); } DataType getDType() const { return getInputs(0)->getDType(); }
DataType getOutDType() const { return getOutput()->getDType(); }
virtual int numInputs() const = 0; virtual int numInputs() const = 0;
virtual int numOutputs() const = 0; virtual int numOutputs() const = 0;
@ -105,7 +105,7 @@ class OperatorObj : public Object {
const TensorVec &newOutputs) const = 0; const TensorVec &newOutputs) const = 0;
protected: protected:
optional<vector<Shape>> inferShape() const; optional<vector<Shape>> inferShape();
vector<DataType> inferDataType() const; vector<DataType> inferDataType() const;
private: private:

View File

@ -2,8 +2,8 @@
#include "core/graph.h" #include "core/graph.h"
#include "core/kernel.h" #include "core/kernel.h"
#include <nlohmann/json_fwd.hpp> #include <nlohmann/json_fwd.hpp>
using json = nlohmann::json;
namespace infini { namespace infini {
using json = nlohmann::json;
class PerfEngine { class PerfEngine {
public: public:

View File

@ -15,6 +15,7 @@ class GraphObj;
class GraphHandlerObj; class GraphHandlerObj;
class RuntimeObj; class RuntimeObj;
class BlobObj; class BlobObj;
template <typename T> class WorkspaceObj;
using TensorBase = Ref<TensorBaseObj>; using TensorBase = Ref<TensorBaseObj>;
using Tensor = Ref<TensorObj>; using Tensor = Ref<TensorObj>;
@ -23,6 +24,7 @@ using Graph = Ref<GraphObj>;
using GraphHandler = Ref<GraphHandlerObj>; using GraphHandler = Ref<GraphHandlerObj>;
using Runtime = Ref<RuntimeObj>; using Runtime = Ref<RuntimeObj>;
using Blob = Ref<BlobObj>; using Blob = Ref<BlobObj>;
template <typename T> using Workspace = Ref<WorkspaceObj<T>>;
using TensorVec = vector<Tensor>; using TensorVec = vector<Tensor>;
using OpVec = vector<Operator>; using OpVec = vector<Operator>;
@ -30,7 +32,7 @@ using OpLists = list<Operator>;
using VType = uint32_t; using VType = uint32_t;
enum class Device { CPU = 1, CUDA, BANG, INTELCPU }; enum class Device { CPU = 1, CUDA, BANG, INTELCPU, KUNLUN };
/***************** Forward declaration end *****************/ /***************** Forward declaration end *****************/
class RuntimeObj : public std::enable_shared_from_this<RuntimeObj> { class RuntimeObj : public std::enable_shared_from_this<RuntimeObj> {
@ -72,6 +74,7 @@ class RuntimeObj : public std::enable_shared_from_this<RuntimeObj> {
} }
bool isCuda() const { return device == Device::CUDA; } bool isCuda() const { return device == Device::CUDA; }
bool isBang() const { return device == Device::BANG; } bool isBang() const { return device == Device::BANG; }
bool isKUNLUN() const { return device == Device::KUNLUN; }
void copyBlob(const TensorObj *dst, const TensorObj *src) const; void copyBlob(const TensorObj *dst, const TensorObj *src) const;
// TODO: unify these copy APIs // TODO: unify these copy APIs
virtual void copyBlobFromCPU(void *dst, const void *src, virtual void copyBlobFromCPU(void *dst, const void *src,

View File

@ -4,11 +4,14 @@
#include "utils/data_convert.h" #include "utils/data_convert.h"
#include <cmath> #include <cmath>
#include <cstring> #include <cstring>
#include <fstream>
#if USE_CUDA #if USE_CUDA
#include "cuda/cuda_runtime.h" #include "cuda/cuda_runtime.h"
#endif #endif
#if USE_BANG
#include "bang/bang_runtime.h"
#endif
namespace infini { namespace infini {
// TODO: how to deal with this // TODO: how to deal with this
@ -31,6 +34,7 @@ class TensorObj : public TensorBaseObj {
size_t getBytes() const { return _size * dtype.getSize(); } size_t getBytes() const { return _size * dtype.getSize(); }
Shape getDims() const { return shape; } Shape getDims() const { return shape; }
void setShape(Shape shape_);
size_t getRank() const { return shape.size(); } size_t getRank() const { return shape.size(); }
Shape getStride() const; Shape getStride() const;
size_t getOffset(const vector<int> &ds) const; size_t getOffset(const vector<int> &ds) const;
@ -41,8 +45,16 @@ class TensorObj : public TensorBaseObj {
bool isOutput() const { return tensorType == TensorType::output; } bool isOutput() const { return tensorType == TensorType::output; }
bool isOthers() const { return tensorType == TensorType::others; } bool isOthers() const { return tensorType == TensorType::others; }
void setWeight() { tensorType = TensorType::weight; } void setWeight() { tensorType = TensorType::weight; }
void setInput() { tensorType = TensorType::input; } void setInput() {
void setOutput() { tensorType = TensorType::output; } if (!this->isWeight()) {
tensorType = TensorType::input;
}
}
void setOutput() {
if (!this->isWeight()) {
tensorType = TensorType::output;
}
}
string tensorTypeToString() const { string tensorTypeToString() const {
switch (tensorType) { switch (tensorType) {
case TensorType::weight: case TensorType::weight:
@ -132,6 +144,7 @@ class TensorObj : public TensorBaseObj {
} }
void printData() const; void printData() const;
void dumpData(std::ofstream &ofs) const;
bool equalData(const Tensor &rhs, double relativeError = 1e-6) const; bool equalData(const Tensor &rhs, double relativeError = 1e-6) const;
template <typename T> bool equalData(const vector<T> &dataVector) { template <typename T> bool equalData(const vector<T> &dataVector) {
@ -180,19 +193,27 @@ class TensorObj : public TensorBaseObj {
} }
template <typename T> template <typename T>
bool equalDataImpl(const T *a, const T *b, size_t size) const { bool equalDataImpl(const T *a, const T *b, size_t size,
double relativeError = 1e-6) const {
for (size_t i = 0; i < size; ++i) { for (size_t i = 0; i < size; ++i) {
if constexpr (std::is_integral_v<T>) { if constexpr (std::is_integral_v<T>) {
if (a[i] != b[i]) if (a[i] != b[i])
return false; return false;
} else if constexpr (std::is_floating_point_v<T>) { } else if constexpr (std::is_floating_point_v<T>) {
if (fabs(a[i] - b[i]) / std::max(fabs(a[i]), fabs(b[i])) > if (std::min(fabs(a[i]), fabs(b[i])) == 0. &&
1e-6) { fabs(a[i] - b[i]) > relativeError) {
printf("Error on %lu: %f %f\n", i, a[i], b[i]);
return false;
} else if (std::min(fabs(a[i]), fabs(b[i])) != 0. &&
fabs(a[i] - b[i]) /
std::max(fabs(a[i]), fabs(b[i])) >
relativeError) {
printf("Error on %lu: %f %f\n", i, a[i], b[i]); printf("Error on %lu: %f %f\n", i, a[i], b[i]);
return false; return false;
} }
} else } else {
static_assert(!sizeof(T), "Unsupported data type"); static_assert(!sizeof(T), "Unsupported data type");
}
} }
return true; return true;
} }
@ -227,8 +248,8 @@ class TensorObj : public TensorBaseObj {
// // std::cerr << "Init beginned " << std::endl; // // std::cerr << "Init beginned " << std::endl;
// #pragma omp parallel for // #pragma omp parallel for
// for (size_t i = 0; i < iEnd; ++i) // for (size_t i = 0; i < iEnd; ++i)
// data[i] = fastrand(random_seed[omp_get_thread_num() * 16]) % // data[i] = fastrand(random_seed[omp_get_thread_num() *
// 10000; // 16]) % 10000;
// // std::cerr << "Init finished" << std::endl; // // std::cerr << "Init finished" << std::endl;
// computed = ComputedFull; // computed = ComputedFull;
// return true; // return true;
@ -273,8 +294,8 @@ class TensorObj : public TensorBaseObj {
// auto nDim = dims.size(); // auto nDim = dims.size();
// auto nBroadcastDim = ds.size() - nDim; // auto nBroadcastDim = ds.size() - nDim;
// for (size_t i = 0; i < nDim; ++i) // for (size_t i = 0; i < nDim; ++i)
// if (ds[nBroadcastDim + i] < 0 || ds[nBroadcastDim + i] >= // if (ds[nBroadcastDim + i] < 0 || ds[nBroadcastDim +
// dims[i]) // i] >= dims[i])
// return (size_t)-1; // return (size_t)-1;
// size_t idx = 0; // size_t idx = 0;
// for (size_t i = 0; i < nDim; ++i) // for (size_t i = 0; i < nDim; ++i)
@ -333,12 +354,14 @@ class TensorObj : public TensorBaseObj {
// return (g_seed >> 16) & 0x7FFF; // return (g_seed >> 16) & 0x7FFF;
// } // }
// std::vector<std::vector<int>> const *getSplittingPoints() const { // std::vector<std::vector<int>> const *getSplittingPoints()
// const {
// assert(!splittingPoints.empty()); // assert(!splittingPoints.empty());
// return &splittingPoints; // return &splittingPoints;
// } // }
// bool setSplittingPoints(std::vector<std::vector<int>> value) { // bool setSplittingPoints(std::vector<std::vector<int>> value)
// {
// assert(!value.empty()); // assert(!value.empty());
// splittingPoints = value; // splittingPoints = value;
// return true; // return true;

42
include/core/workspace.h Normal file
View File

@ -0,0 +1,42 @@
#pragma once
#include "core/runtime.h"
namespace infini {
template <class T> class WorkspaceObj {
private:
T workspace; // workspace pointer
size_t workspaceSize; // Size of workspace
size_t workspaceAlloc; // currently use workspace size
public:
WorkspaceObj(T workspace_, size_t workspaceSize_)
: workspace(workspace_), workspaceSize(workspaceSize_) {
workspaceAlloc = 0;
}
virtual ~WorkspaceObj() {
// Dealloc workspace in RuntimeObj
// Set workspace = nullptr here
workspace = nullptr;
}
size_t getWorkspaceSize() const { return workspaceSize; }
T getWorkspace(size_t size) {
// Get unused workspace
IT_ASSERT(size + workspaceAlloc <= workspaceSize);
auto ret = (T)(static_cast<uint8_t *>(workspace) + workspaceAlloc);
workspaceAlloc += size;
return ret;
}
T getWorkspace() {
// Override getWorkspace in order to dealloc in runtime
return workspace;
}
void resetWorkspace() {
// Reset workspaceAlloc every time end kernel
workspaceAlloc = 0;
}
size_t getWorkspaceAlloc() const { return workspaceAlloc; }
};
} // namespace infini

View File

@ -0,0 +1,17 @@
#pragma once
#include "core/common.h"
#include <cstdio>
struct AttentionKVCacheMetadata {
int dimSize[4];
int stride[4];
};
namespace infini {
void attention_kvcache_kernel(float *input_k_cache, float *input_v_cache,
float *input_q, float *input_k, float *input_v,
int *position_id, float *output_matmul,
const AttentionKVCacheMetadata &compMeta,
float *output_O_temp, float *output_sum_temp);
} // namespace infini

View File

@ -5,6 +5,7 @@
#include <cuda_profiler_api.h> #include <cuda_profiler_api.h>
#include <cudnn.h> #include <cudnn.h>
#include <curand.h> #include <curand.h>
#include <memory>
#define checkCudaError(call) \ #define checkCudaError(call) \
if (auto err = call; err != cudaSuccess) \ if (auto err = call; err != cudaSuccess) \
@ -111,4 +112,20 @@ inline const char *curandGetErrorString(curandStatus_t error) {
using CudaPtr = void *; using CudaPtr = void *;
class CUDAStream {
public:
CUDAStream(const CUDAStream &) = delete;
CUDAStream(CUDAStream &&) = delete;
void operator=(const CUDAStream &) = delete;
void operator=(CUDAStream &&) = delete;
static cudaStream_t getCurrentStream() { return _stream; }
static void Init() { CUDAStream::_stream = 0; };
static void createStream() { checkCudaError(cudaStreamCreate(&_stream)); }
static void destroyStream() { checkCudaError(cudaStreamDestroy(_stream)); }
private:
CUDAStream(){};
static cudaStream_t _stream;
};
} // namespace infini } // namespace infini

View File

@ -1,8 +1,20 @@
#pragma once #pragma once
namespace infini { namespace infini {
void div_kernel(float *a, float *b, float *c, int a0, int a1, int a2, int a3, void div_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
int b0, int b1, int b2, int b3, int c0, int c1, int c2, int c3); int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
void pow_kernel(float *a, float *b, float *c, int a0, int a1, int a2, int a3, int c2, int c3);
int b0, int b1, int b2, int b3, int c0, int c1, int c2, int c3); void add_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
int c2, int c3);
void pow_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
int c2, int c3);
void less_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
int c2, int c3);
void div_const_kernel(int dType, void *a, void *b, void *c, size_t n);
void pow_const_kernel(int dType, void *a, void *b, void *c, size_t n);
}; // namespace infini }; // namespace infini

View File

@ -3,7 +3,10 @@
#include "operators/unary.h" #include "operators/unary.h"
#include "utils/small_array.h" #include "utils/small_array.h"
namespace infini { namespace infini {
void expandKernel(float *input, float *output, int nDims, int outputsize, void expandKernel(int dType, void *input, void *output, int nDims,
SmallArray inputShape, SmallArray outputShape); int outputsize, SmallArray inputShape,
SmallArray outputShape);
void expandRowKernel(int dType, void *input, void *output, int n_rows,
int row_len);
}; // namespace infini }; // namespace infini

View File

@ -0,0 +1,17 @@
#pragma once
#include "operators/unary.h"
namespace infini {
void LaynormKernel(const float *input, const float *scale, const float eps,
int size, int scaleSize, const int dimsize, const int stride,
float *output, const float *bias, int biasSize);
void LaynormKernel(const float *input, const float *scale, const float eps,
int size, int scaleSize, const int dimsize, const int stride,
float *output);
void LaynormKernel(const half *input, const half *scale, const half eps,
int size, int scaleSize, const int dimsize, const int stride,
half *output, const half *bias, int biasSize);
void LaynormKernel(const half *input, const half *scale, const half eps,
int size, int scaleSize, const int dimsize, const int stride,
half *output);
}; // namespace infini

View File

@ -10,10 +10,11 @@ typedef struct {
int wholeNDim[MAX_DIM]; // dim size after padding or before slicing int wholeNDim[MAX_DIM]; // dim size after padding or before slicing
int partNDim[MAX_DIM]; // dim size before padding or after slicing int partNDim[MAX_DIM]; // dim size before padding or after slicing
int partStride[MAX_DIM]; // stride before padding or after slicing int partStride[MAX_DIM]; // stride before padding or after slicing
int DType;
} TransMetaData; } TransMetaData;
namespace infini { namespace infini {
void pad_slice_kernel(float *partData, float *wholeData, void pad_slice_kernel(void *partData, void *wholeData,
const TransMetaData &metadata, int nDims, int num, const TransMetaData &metadata, int nDims, int num,
bool isPad); bool isPad);
} // namespace infini } // namespace infini

View File

@ -0,0 +1,10 @@
#pragma once
#include "operators/rms_norm.h"
namespace infini {
void rmsnorm_kernel(int dType, void *input, void *weight, void *output,
int num_tokens, int hidden_size);
}; // namespace infini

12
include/cuda/cuda_rope.h Normal file
View File

@ -0,0 +1,12 @@
#pragma once
#include "operators/rope.h"
#include "utils/small_array.h"
namespace infini {
void rope_kernel(int dType, int *pos, void *input, void *output, int size,
int dim_model, int dim_head, int hidden_stride,
int pos_stride);
}; // namespace infini

View File

@ -14,6 +14,9 @@ class CudaRuntimeObj : public RuntimeObj {
std::unique_ptr<CommunicatorObj> comm; std::unique_ptr<CommunicatorObj> comm;
CudaPtr workspace; CudaPtr workspace;
size_t workspaceSize; size_t workspaceSize;
bool isCudaGraphCreated;
cudaGraph_t cudaGraph;
cudaGraphExec_t cudaGraphInstance;
public: public:
explicit CudaRuntimeObj(int deviceId = 0) explicit CudaRuntimeObj(int deviceId = 0)
@ -26,9 +29,16 @@ class CudaRuntimeObj : public RuntimeObj {
// size_t longformerNum = 3lu * (1 << 30); // size_t longformerNum = 3lu * (1 << 30);
workspaceSize = 7ll << 30; // 7 GB workspaceSize = 7ll << 30; // 7 GB
workspace = alloc(workspaceSize); workspace = alloc(workspaceSize);
isCudaGraphCreated = false;
CUDAStream::Init();
} }
virtual ~CudaRuntimeObj() { virtual ~CudaRuntimeObj() {
try { try {
if (isCudaGraphCreated) {
checkCudaError(cudaGraphExecDestroy(cudaGraphInstance));
checkCudaError(cudaGraphDestroy(cudaGraph));
CUDAStream::destroyStream();
}
dealloc(workspace); dealloc(workspace);
checkCudnnError(cudnnDestroy(cudnn)); checkCudnnError(cudnnDestroy(cudnn));
checkCublasError(cublasDestroy(cublas)); checkCublasError(cublasDestroy(cublas));
@ -75,6 +85,8 @@ class CudaRuntimeObj : public RuntimeObj {
void runWithoutSync(const Graph &graph) const; void runWithoutSync(const Graph &graph) const;
void runWithCudaGraph(const Graph &graph);
// init communicator // init communicator
void initComm(const string &name, int worldSize, int rank) final; void initComm(const string &name, int worldSize, int rank) final;

View File

@ -0,0 +1,8 @@
#pragma once
#include "utils/small_array.h"
namespace infini {
void softmax_kernel(int num_blocks, float *input, float *output, int size,
int dimsize, int stride);
void softmax_kernel(int num_blocks, half *input, half *output, int size,
int dimsize, int stride);
} // namespace infini

View File

@ -3,13 +3,13 @@
#include <cstdio> #include <cstdio>
const int BATCH_SIZE = 32; // parallel tensor number. const int BATCH_SIZE = 32; // parallel tensor number.
const int DIM_MAX_SIZE = 4; const int DIM_MAX_SIZE = 8;
// Concat operator acts like element tensors composing to one big tensor,and // Concat operator acts like element tensors composing to one big tensor,and
// split operator acts like one big tensor being composed by element // split operator acts like one big tensor being composed by element
// tensors. // tensors.
struct ElementTensorMetadata { template <typename T> struct ElementTensorMetadata {
float *data[BATCH_SIZE]; T *data[BATCH_SIZE];
int dimBgNo[BATCH_SIZE]; // the dimention begin no of the element tensor in int dimBgNo[BATCH_SIZE]; // the dimention begin no of the element tensor in
// the composed tensor. // the composed tensor.
int dimSize[BATCH_SIZE]; // the dimention size of the element tensor. int dimSize[BATCH_SIZE]; // the dimention size of the element tensor.
@ -20,16 +20,17 @@ struct ElementTensorMetadata {
data[i], dimBgNo[i], dimSize[i], nElements[i]); data[i], dimBgNo[i], dimSize[i], nElements[i]);
} }
}; };
template <typename T> struct ComposedTensorMetadata {
struct ComposedTensorMetadata {
int dimSize[DIM_MAX_SIZE]; int dimSize[DIM_MAX_SIZE];
int stride[DIM_MAX_SIZE]; int stride[DIM_MAX_SIZE];
float *data; T *data;
}; };
namespace infini { namespace infini {
void split_concat_kernel(const ElementTensorMetadata &eleMeta, void split_concat_kernel(const ElementTensorMetadata<float> &eleMeta,
const ComposedTensorMetadata &compMeta, int dim, const ComposedTensorMetadata<float> &compMeta, int dim,
int batchSize, int nDims, bool isSplit);
void split_concat_kernel(const ElementTensorMetadata<half> &eleMeta,
const ComposedTensorMetadata<half> &compMeta, int dim,
int batchSize, int nDims, bool isSplit); int batchSize, int nDims, bool isSplit);
} // namespace infini } // namespace infini

View File

@ -5,7 +5,7 @@
namespace infini { namespace infini {
void transpose_kernel(float *input, float *output, int nDims, int size, void transpose_kernel(int dType, void *input, void *output, int nDims, int size,
SmallArray strides, SmallArray outputShape); SmallArray strides, SmallArray outputShape);
}; // namespace infini }; // namespace infini

View File

@ -3,34 +3,22 @@
#include "operators/unary.h" #include "operators/unary.h"
namespace infini { namespace infini {
// TODO(constroy): num should be size_t. template <typename T> void softmax_kernel(T *input, T *output, size_t num);
void softmax_kernel(float *input, float *output, int num); template <typename T> void relu_kernel(T *input, T *output, size_t num);
void relu_kernel(float *input, float *output, int num); template <typename T> void silu_kernel(T *input, T *output, size_t num);
void sigmoid_kernel(float *input, float *output, int num); template <typename T> void sigmoid_kernel(T *input, T *output, size_t num);
void tanh_kernel(float *input, float *output, int num); template <typename T> void tanh_kernel(T *input, T *output, size_t num);
void abs_kernel(float *input, float *output, int num); template <typename T> void abs_kernel(T *input, T *output, size_t num);
void sqrt_kernel(float *input, float *output, int num); template <typename T> void sqrt_kernel(T *input, T *output, size_t num);
template <typename T> void neg_kernel(T *input, T *output, size_t num);
template <typename T> void gelu_kernel(T *input, T *output, size_t num);
template <typename T> void erf_kernel(T *input, T *output, size_t num);
template <typename T> void hard_sigmoid_kernel(T *input, T *output, size_t num);
template <typename T> void hard_swish_kernel(T *input, T *output, size_t num);
void unary_kernel(const Operator &_op) { template <typename INPUT, typename OUTPUT>
auto op = as<UnaryObj>(_op); void cast_kernel(INPUT *input, OUTPUT *output, size_t num);
float *const inputData = (op->getInputs(0)->getRawDataPtr<float *>());
float *const outputData = (op->getOutput()->getRawDataPtr<float *>());
size_t num = op->getOutput()->size(); void unary_kernel(const Operator &_op);
if (op->getOpType() == OpType::Softmax)
softmax_kernel(inputData, outputData, num);
else if (op->getOpType() == OpType::Relu)
relu_kernel(inputData, outputData, num);
else if (op->getOpType() == OpType::Sigmoid)
sigmoid_kernel(inputData, outputData, num);
else if (op->getOpType() == OpType::Tanh)
tanh_kernel(inputData, outputData, num);
else if (op->getOpType() == OpType::Abs)
abs_kernel(inputData, outputData, num);
else if (op->getOpType() == OpType::Sqrt)
sqrt_kernel(inputData, outputData, num);
else
IT_TODO_HALT();
}
}; // namespace infini }; // namespace infini

View File

@ -1,11 +1,29 @@
#pragma once
#include "core/tensor.h" #include "core/tensor.h"
#include "cuda/cuda_common.h"
namespace infini { namespace infini {
void cudaPrintFloat(float *x, int len); void cudaPrintFloat(float *x, int len);
void cudaPrintTensor(const Tensor &tensor) { void cudaPrintTensor(const Tensor &tensor);
cudaPrintFloat(tensor->getRawDataPtr<float *>(), tensor->size());
}
} // namespace infini cudnnDataType_t cudnnDataTypeConvert(DataType dataType);
cudaDataType cublasDataTypeConvert(DataType);
template <int index> struct DT_CUDA {};
template <> struct DT_CUDA<0> { using t = bool; };
template <> struct DT_CUDA<1> { using t = float; };
template <> struct DT_CUDA<2> { using t = unsigned char; };
template <> struct DT_CUDA<3> { using t = char; };
template <> struct DT_CUDA<4> { using t = unsigned short; };
template <> struct DT_CUDA<5> { using t = short; };
template <> struct DT_CUDA<6> { using t = int; };
template <> struct DT_CUDA<7> { using t = long long; };
template <> struct DT_CUDA<9> { using t = bool; };
template <> struct DT_CUDA<10> { using t = half; };
template <> struct DT_CUDA<11> { using t = double; };
template <> struct DT_CUDA<12> { using t = unsigned int; };
template <> struct DT_CUDA<13> { using t = unsigned long long; };
template <> struct DT_CUDA<16> { using t = nv_bfloat16; };
} // namespace infini

View File

@ -3,9 +3,15 @@
#include "utils/small_array.h" #include "utils/small_array.h"
namespace infini { namespace infini {
void whereKernel(const float *inputX, const float *inputY, void whereKernel(const float *inputX, const float *inputY,
const uint8_t *condition, float *output, int nDims, const uint8_t *condition, float *output, int nDims,
SmallArray inputXShape, SmallArray inputYShape, int outputsize, SmallArray inputXShape, SmallArray inputYShape,
SmallArray conditionShape, SmallArray outputShape); SmallArray conditionShape, SmallArray outputShape, int xSize,
int ySize, int cSize);
void whereKernel(const half *inputX, const half *inputY,
const uint8_t *condition, half *output, int nDims,
int outputsize, SmallArray inputXShape, SmallArray inputYShape,
SmallArray conditionShape, SmallArray outputShape, int xSize,
int ySize, int cSize);
}; // namespace infini }; // namespace infini

View File

@ -1,19 +1,61 @@
#pragma once #pragma once
#include "core/data_type.h" #include "core/data_type.h"
#include "core/operator.h"
#include "operators/gather.h"
namespace infini { namespace infini {
struct GatherMetaData { struct GatherMetaData {
// Pointer to indices
void *indexValue; void *indexValue;
// Type of index values
DataType indexType; DataType indexType;
// Type of input and output data
DataType dataType;
// Axis of the gather operation
int axis; int axis;
// Rank of input
int inNDim; int inNDim;
// Rank of output
int outNDim; int outNDim;
// Rank of indices
int idxNDim; int idxNDim;
// Shape of output
int outDim[4]; int outDim[4];
// Shape of indices
int idxDim[4]; int idxDim[4];
// Strides of indices
int idxStride[4]; int idxStride[4];
// Strides of input
int inStride[4]; int inStride[4];
}; };
void gather_kernel(float *in, float *out, GatherMetaData metaData, size_t num); inline void initGatherMetaData(GatherMetaData &metaData,
const Ref<OperatorObj> &_op) {
memset(&metaData, 0, sizeof(metaData));
auto op = as<GatherBaseObj>(_op);
Ref<TensorObj> in = op->getInputs(0);
Ref<TensorObj> index = op->getInputs(1);
Ref<TensorObj> out = op->getOutput();
metaData.indexValue = index->getRawDataPtr<void *>();
metaData.indexType = index->getDType();
metaData.dataType = in->getDType();
metaData.axis = op->getAxis();
metaData.inNDim = in->getRank();
metaData.outNDim = out->getRank();
metaData.idxNDim = index->getRank();
for (int i = 0; i < metaData.outNDim; ++i)
metaData.outDim[i] = out->getDims()[i];
for (int i = 0; i < metaData.idxNDim; ++i) {
metaData.idxDim[i] = index->getDims()[i];
metaData.idxStride[i] = index->getStride()[i];
}
for (int i = 0; i < metaData.inNDim; ++i) {
metaData.inStride[i] = in->getStride()[i];
}
}
template <typename T>
void gather_kernel(T *in, T *out, GatherMetaData metaData, size_t num);
void gather_elements_kernel(void *in, void *out, GatherMetaData metaData,
size_t num);
} // namespace infini } // namespace infini

View File

@ -1,6 +0,0 @@
#pragma once
namespace infini {
void softmax_kernel(int max_threadblock_size, int batch_size, float *x,
float *y, int dim, int stride);
}

View File

@ -0,0 +1,23 @@
#include "core/op_type.h"
#include "kunlun/kunlun_common.h"
namespace infini {
using KunlunActType = xdnn::Activation_t;
KunlunActType parseActType(ActType act) {
switch (act) {
case ActType::None:
return KunlunActType::LINEAR;
case ActType::Tanh:
return KunlunActType::TANH;
case ActType::Sigmoid:
return KunlunActType::SIGMOID;
case ActType::Relu:
return KunlunActType::RELU6;
default:
fprintf(stderr, "Activation Type not support yet!\n");
break;
}
return KunlunActType::LINEAR;
}
}; // namespace infini

View File

@ -0,0 +1,22 @@
#pragma once
#include "core/common.h"
#include "xpu/runtime_ex.h"
#include "xpu/xdnn.h"
namespace xdnn = baidu::xpu::api;
#define checkKUNLUNError(call) \
{ \
auto err = call; \
if (XPU_SUCCESS != err) { \
fprintf(stderr, "KUNLUN error in %s:%i : %s.\n", __FILE__, \
__LINE__, xpu_strerror(err)); \
exit(EXIT_FAILURE); \
} \
}
namespace infini {
using KUNLUNPtr = void *;
} // namespace infini

View File

@ -0,0 +1,24 @@
#pragma once
#include "core/kernel.h"
#include "kunlun/kunlun_runtime.h"
namespace infini {
class KUNLUNKernelWithoutConfig : public Kernel {
public:
virtual void compute(const Operator &op, const PerfRecord &record,
const RuntimeObj *context) const {
compute(op, context);
}
virtual void compute(const Operator &op,
const RuntimeObj *context) const = 0;
// Premise: op is idempotent since it is called multiple times.
virtual PerfRecord tune(const Operator &op,
const RuntimeObj *_context) const {
auto context = dynamic_cast<const KUNLUNRuntimeObj *>(_context);
return make_ref<PerfRecordObj>(timeit([&]() { compute(op, _context); },
[&]() { context->sync(); }));
}
};
} // namespace infini

View File

@ -0,0 +1,81 @@
#pragma once
#include "core/runtime.h"
#include "core/workspace.h"
#include "kunlun/kunlun_common.h"
#ifdef INFINI_USE_XCCL
#include "kunlun/xccl_communicator.h"
#endif
namespace infini {
class KUNLUNRuntimeObj : public RuntimeObj {
private:
xdnn::Context *ctx;
std::unique_ptr<CommunicatorObj> comm;
// KUNLUNPtr workspace;
// size_t workspaceSize;
Workspace<KUNLUNPtr> workspace;
public:
KUNLUNRuntimeObj(int deviceId = 0) : RuntimeObj(Device::KUNLUN) {
xpu_set_device(deviceId);
ctx = xdnn::create_context();
// 10GB for Longformer
// size_t longformerNum = 3lu * (1 << 30);
size_t workspaceSize = 3llu << 30; // 3 GB
KUNLUNPtr wkspacePtr = alloc(workspaceSize);
workspace =
make_ref<WorkspaceObj<KUNLUNPtr>>(wkspacePtr, workspaceSize);
}
virtual ~KUNLUNRuntimeObj() {
KUNLUNPtr wkspacePtr = workspace->getWorkspace();
dealloc(wkspacePtr);
xdnn::destroy_context(ctx);
}
string toString() const override;
void run(const Graph &graph, bool tune = false,
bool profiling = false) const;
// double runEvaluation(const Graph &graph, int nWarmups,
// int nEvaluations) const;
void sync() const;
KUNLUNPtr alloc(size_t size) override {
void *ptr;
checkKUNLUNError(
xpu_malloc_ex((void **)&ptr, size, XPUMemoryKind::XPU_MEM_MAIN));
return ptr;
}
void dealloc(void *ptr) override { xpu_free(ptr); }
xdnn::Context *KUNLUNHandle() const { return ctx; }
// Get $size workspace by bytes
KUNLUNPtr getWorkspace(size_t size) const {
auto ret = workspace->getWorkspace(size);
return ret;
}
Workspace<KUNLUNPtr> getWorkspaceObj() const { return workspace; }
void copyBlobFromCPU(void *dst, const void *src,
size_t bytes) const override {
xpu_memcpy(dst, const_cast<void *>(src), bytes,
XPUMemcpyKind::XPU_HOST_TO_DEVICE);
}
void copyBlobToCPU(void *dst, const void *src,
size_t bytes) const override {
xpu_memcpy(dst, const_cast<void *>(src), bytes,
XPUMemcpyKind::XPU_DEVICE_TO_HOST);
}
void copyBlobInsideRuntime(void *dst, const void *src,
size_t bytes) const override {
xpu_memcpy(dst, const_cast<void *>(src), bytes,
XPUMemcpyKind::XPU_DEVICE_TO_DEVICE);
}
void initComm(const string &name, int worldSize, int rank) final;
CommunicatorObj &getCommunicator() const final { return *comm; }
private:
void runWithoutSync(const Graph &graph, bool tune, bool profiling) const;
};
} // namespace infini

View File

@ -0,0 +1,10 @@
#pragma once
namespace infini {
namespace opTimer {
double getPerfConvXdnn(int n, int c, int h, int w, int f, int r, int s,
int padh, int padw, int strideh, int stridew,
int dilationh, int dilationw, int group,
const char *name);
double getPerfMatmulXdnn(int b, int m, int n, int k, const char *name);
} // namespace opTimer
} // namespace infini

View File

@ -0,0 +1,60 @@
#pragma once
#include "core/communicator.h"
#include "xpu/bkcl.h"
#include <chrono>
#include <filesystem>
#include <fstream>
#include <thread>
#define checkXcclError(call) \
{ \
auto err = call; \
if (BKCL_SUCCESS != err) { \
fprintf(stderr, "XCCL error in %s:%i.\n", __FILE__, __LINE__); \
exit(EXIT_FAILURE); \
} \
}
namespace infini {
class XcclCommunicatorObj final : public CommunicatorObj {
private:
BKCLContext_t comm;
public:
XcclCommunicatorObj(const string &name, int worldSize, int rank)
: CommunicatorObj(worldSize, rank) {
const std::string filePath("./" + name + "_xccl_id.bin");
BKCLUniqueId commId;
if (rank == 0) {
checkXcclError(bkcl_get_unique_id(&commId));
std::ofstream ofs(filePath, std::ios::binary);
ofs.write((char *)&commId, sizeof(BKCLUniqueId));
} else {
auto begin = std::chrono::steady_clock::now();
while (!std::filesystem::exists(filePath)) {
auto now = std::chrono::steady_clock::now();
_IT_ASSERT_2(now < begin + std::chrono::seconds(10),
"time limit (10s) exceeded.");
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
std::ifstream ifs(filePath, std::ios::binary);
ifs.read((char *)&commId, sizeof(BKCLUniqueId));
}
checkXcclError(bkcl_init_rank(&comm, rank, worldSize, &commId));
if (rank == 0) {
std::filesystem::remove(filePath);
}
}
BKCLContext_t getXcclComm() { return comm; }
~XcclCommunicatorObj() final { checkXcclError(bkcl_destroy_context(comm)); }
virtual string toString() const final {
std::ostringstream oss;
oss << "XCCL communicator";
return oss.str();
}
};
} // namespace infini

View File

@ -24,7 +24,7 @@
// clang-format on // clang-format on
namespace nnet { namespace nnet {
int matchExprResult(Derivator &derivator, string fn); int matchExprResult(Derivator &derivator, string pathRelativeToProjectHome);
bool checkExprLogSame(string fnPrefix, int start, int end); bool checkExprLogSame(string pathRelativeToProjectHome, int start, int end);
bool checkExprsEquvivalence(VecExpr exprs); bool checkExprsEquvivalence(VecExpr exprs);
} // namespace nnet } // namespace nnet

View File

@ -35,7 +35,7 @@ class G2BMMObj : public OperatorObj {
OP_CLONE(G2BMMObj); OP_CLONE(G2BMMObj);
std::string toString() const override; std::string toString() const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int numInputs() const override { return 2; } int numInputs() const override { return 2; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }

View File

@ -33,7 +33,7 @@ class GBMMObj : public OperatorObj {
OP_CLONE(GBMMObj); OP_CLONE(GBMMObj);
std::string toString() const override; std::string toString() const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int numInputs() const override { return 2; } int numInputs() const override { return 2; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }

View File

@ -7,7 +7,7 @@ class ActivationBackwardObj : public OperatorObj {
ActivationBackwardObj(OpType type, GraphObj *graph, Tensor y, Tensor diff_y, ActivationBackwardObj(OpType type, GraphObj *graph, Tensor y, Tensor diff_y,
Tensor x, Tensor diff_x); Tensor x, Tensor diff_x);
OP_CLONE(ActivationBackwardObj); OP_CLONE(ActivationBackwardObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 3; } int numInputs() const override { return 3; }

View File

@ -27,7 +27,7 @@ class AllGatherObj : public OperatorObj {
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
int numOutputs() const override { return world_size; } int numOutputs() const override { return world_size; }
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;

View File

@ -33,7 +33,7 @@ class AllReduceBaseObj : public OperatorObj {
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override { optional<vector<Shape>> inferShape(const TensorVec &inputs) override {
return {{inputs[0]->getDims()}}; return {{inputs[0]->getDims()}};
}; };

View File

@ -0,0 +1,43 @@
#pragma once
#include "core/operator.h"
namespace infini {
/**
* @brief Fused Attention with KVCache input operator. All the input and output
* tensors should have the same rank except for the position_id.
*
*/
class AttentionKVCacheObj : public OperatorObj {
int dim;
public:
/**
* @brief Construct a new AttentionKVCache object.
*
* @param graph The computation graph that this operator belongs to.
* @param input_k_cache The k_cache input tensor.
* @param input_v_cache The v_cache input tensor.
* @param input_q The query input tensor.
* @param input_k The key input tensor.
* @param input_v The value input tensor.
* @param position_id The positon id of the query,
* @param output_matmul The query output tensor.
*/
AttentionKVCacheObj(GraphObj *graph, Tensor input_k_cache,
Tensor input_v_cache, Tensor input_q, Tensor input_k,
Tensor input_v, Tensor position_id,
Tensor output_matmul);
OP_CLONE(AttentionKVCacheObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return 6; }
int numOutputs() const override { return 1; }
int getDim() const { return dim; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

View File

@ -34,7 +34,7 @@ class BatchNormObj : public OperatorObj {
Tensor var, Tensor scale, Tensor bias, float momentum = 0.9, Tensor var, Tensor scale, Tensor bias, float momentum = 0.9,
float eps = 1e-5, bool trainingMode = false); float eps = 1e-5, bool trainingMode = false);
OP_CLONE(BatchNormObj); OP_CLONE(BatchNormObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
// output size will be 3 when training // output size will be 3 when training

View File

@ -26,7 +26,7 @@ class BroadcastObj : public OperatorObj {
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override { optional<vector<Shape>> inferShape(const TensorVec &inputs) override {
return {{inputs[0]->getDims()}}; return {{inputs[0]->getDims()}};
}; };

View File

@ -22,7 +22,7 @@ class ConcatObj : public OperatorObj {
ConcatObj(GraphObj *graph, TensorVec inputs, Tensor output, int dim); ConcatObj(GraphObj *graph, TensorVec inputs, Tensor output, int dim);
OP_CLONE(ConcatObj); OP_CLONE(ConcatObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return inputs.size(); } int numInputs() const override { return inputs.size(); }

View File

@ -142,7 +142,7 @@ class ConvObj : public ConvBaseObj {
ActType act = ActType::None); ActType act = ActType::None);
OP_CLONE(ConvObj); OP_CLONE(ConvObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int getNumGroups() const override { return c / getChannelPerGroup(); } int getNumGroups() const override { return c / getChannelPerGroup(); }
private: private:
@ -164,7 +164,7 @@ class ConvBackwardFilterObj : public ConvBaseObj {
int sh = 1, int sw = 1, int dh = 1, int dw = 1, int sh = 1, int sw = 1, int dh = 1, int dw = 1,
Tensor bias = nullptr, ActType act = ActType::None); Tensor bias = nullptr, ActType act = ActType::None);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
ActType getAct() const { return act; } ActType getAct() const { return act; }
int getNumGroups() const override { return c / getChannelPerGroup(); } int getNumGroups() const override { return c / getChannelPerGroup(); }
@ -191,7 +191,7 @@ class ConvTransposed2dObj : public ConvBaseObj {
Tensor bias = nullptr, ActType act = ActType::None); Tensor bias = nullptr, ActType act = ActType::None);
OP_CLONE(ConvTransposed2dObj); OP_CLONE(ConvTransposed2dObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int getNumGroups() const override { return group; } int getNumGroups() const override { return group; }
std::pair<int, int> getOutputPadding() const { return {oph, opw}; } std::pair<int, int> getOutputPadding() const { return {oph, opw}; }
@ -218,7 +218,7 @@ class ConvTransposed2dNHWCObj : public ConvBaseObj {
Tensor bias = nullptr, ActType act = ActType::None); Tensor bias = nullptr, ActType act = ActType::None);
OP_CLONE(ConvTransposed2dNHWCObj); OP_CLONE(ConvTransposed2dNHWCObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int getNumGroups() const override { return group; } int getNumGroups() const override { return group; }
private: private:

View File

@ -7,7 +7,7 @@ class DetObj : public OperatorObj {
enum Mode { NormalDet = 0, LogDet }; enum Mode { NormalDet = 0, LogDet };
DetObj(GraphObj *graph, Tensor input, Tensor output, Mode mode); DetObj(GraphObj *graph, Tensor input, Tensor output, Mode mode);
OP_CLONE(DetObj); OP_CLONE(DetObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }

View File

@ -37,7 +37,7 @@ class DropoutObj : public OperatorObj {
DropoutObj(GraphObj *graph, Tensor data, Tensor output, Tensor mask, DropoutObj(GraphObj *graph, Tensor data, Tensor output, Tensor mask,
float ratio, bool training_mode); float ratio, bool training_mode);
OP_CLONE(DropoutObj); OP_CLONE(DropoutObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }

View File

@ -21,7 +21,7 @@ class ElementWiseObj : public OperatorObj {
*/ */
ElementWiseObj(OpType type, GraphObj *graph, Tensor input0, Tensor input1, ElementWiseObj(OpType type, GraphObj *graph, Tensor input0, Tensor input1,
Tensor output); Tensor output);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 2; } int numInputs() const override { return 2; }
@ -38,7 +38,7 @@ class MSELossObj : public OperatorObj {
MSELossObj(GraphObj *graph, Tensor input0, Tensor input1, MSELossObj(GraphObj *graph, Tensor input0, Tensor input1,
Reduction reduction, Tensor output); Reduction reduction, Tensor output);
OP_CLONE(MSELossObj); OP_CLONE(MSELossObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
Reduction getReduction() const { return reductionMode; } Reduction getReduction() const { return reductionMode; }
std::string toString() const override; std::string toString() const override;

View File

@ -21,7 +21,7 @@ class ExpandObj : public OperatorObj {
*/ */
ExpandObj(GraphObj *graph, Tensor input, Tensor output, Shape dims); ExpandObj(GraphObj *graph, Tensor input, Tensor output, Shape dims);
OP_CLONE(ExpandObj); OP_CLONE(ExpandObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }

View File

@ -23,7 +23,7 @@ class ExtendObj : public OperatorObj {
ExtendObj(GraphObj *graph, Tensor input, Tensor output, int dim, ExtendObj(GraphObj *graph, Tensor input, Tensor output, int dim,
int num = 1); int num = 1);
OP_CLONE(ExtendObj); OP_CLONE(ExtendObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }

View File

@ -3,14 +3,28 @@
#include "core/operator.h" #include "core/operator.h"
namespace infini { namespace infini {
class GatherBaseObj : public OperatorObj {
protected:
int axis;
public:
GatherBaseObj(OpType opType, TensorVec inputs, TensorVec outputs, int axis)
: OperatorObj(opType, inputs, outputs), axis(axis) {}
virtual ~GatherBaseObj() {}
int numInputs() const override { return 2; }
int numOutputs() const override { return 1; }
int getAxis() const { return axis; }
};
/** /**
* @brief Gather and concatenate given positions on a certain dimension of the * @brief Gather and concatenate given positions on a certain dimension of the
* input tensor using an index tensor. * input tensor using an index tensor.
* *
*/ */
class GatherObj : public OperatorObj { class GatherObj : public GatherBaseObj {
int axis;
public: public:
/** /**
* @brief Construct a new Gather object. * @brief Construct a new Gather object.
@ -25,10 +39,7 @@ class GatherObj : public OperatorObj {
int axis); int axis);
OP_CLONE(GatherObj); OP_CLONE(GatherObj);
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 2; } optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int numOutputs() const override { return 1; }
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override;
int getAxis() const { return axis; }
vector<DataType> inferDataType(const TensorVec &inputs) const override; vector<DataType> inferDataType(const TensorVec &inputs) const override;
private: private:
@ -36,4 +47,33 @@ class GatherObj : public OperatorObj {
vector<int> getWorkloadVector() const override; vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override; vector<int> getOpAttrVector() const override;
}; };
/**
* @brief GatherElements takes two inputs data and indices of the
* same rank r >= 1 and an optional attribute axis that identifies
* an axis of data.
*
*/
class GatherElementsObj : public GatherBaseObj {
public:
/**
* @brief Construct a new GatherElements object.
*
* @param graph The computation graph that this operator belongs to.
* @param input The input tensor.
* @param indices The index tensor.
* @param output The output tensor. Same shape as indices.
* @param axis The axis to gather on.
*/
GatherElementsObj(GraphObj *graph, Tensor input, Tensor indices,
Tensor output, int axis);
OP_CLONE(GatherElementsObj);
std::string toString() const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
vector<DataType> inferDataType(const TensorVec &inputs) const override;
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini } // namespace infini

View File

@ -0,0 +1,30 @@
#pragma once
#include "core/operator.h"
namespace infini {
class LayerNormObj : public OperatorObj {
float eps;
int axis, stash_type;
public:
LayerNormObj(GraphObj *graph, Tensor input, Tensor scale, Tensor output,
Tensor bias = nullptr, float eps = 1e-5, int axis = -1,
int stash_type = 1);
OP_CLONE(LayerNormObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
Tensor getBias() const { return inputs.size() > 2 ? inputs[2] : nullptr; }
int numInputs() const override { return inputs.size(); }
int numOutputs() const override { return outputs.size(); }
float getEps() const { return eps; }
int getAxis() const { return axis; }
int getStashType() const { return stash_type; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
vector<DataType> inferDataType(const TensorVec &inputs) const override;
};
} // namespace infini

29
include/operators/lrn.h Normal file
View File

@ -0,0 +1,29 @@
#pragma once
#include "core/operator.h"
namespace infini {
class LRNObj : public OperatorObj {
public:
LRNObj(GraphObj *graph, Tensor inputX, Tensor inputY, float alpha,
float beta, float bias, int size);
OP_CLONE(LRNObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return inputs.size(); }
int numOutputs() const override { return 1; }
auto getAlphaBetaBias() const {
return tuple(alpha_value, beta_value, bias_value);
}
auto getSize() const { return size_value; }
private:
float alpha_value, beta_value, bias_value;
int size_value;
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

View File

@ -17,6 +17,9 @@ class MatmulObj : public OperatorObj {
// Auxiliary attributes which are not a part of operator attributes. // Auxiliary attributes which are not a part of operator attributes.
int b, m, n, k; int b, m, n, k;
// Specifies the data precision for the matrix multiply.
std::string computeType = "default";
public: public:
/** /**
* @brief Matmul operator with batch broadcast and tensor transpose * @brief Matmul operator with batch broadcast and tensor transpose
@ -38,14 +41,15 @@ class MatmulObj : public OperatorObj {
* @param transB If matrix B should be transposed when computing. * @param transB If matrix B should be transposed when computing.
* @param bias The bias tensor. * @param bias The bias tensor.
* @param act The activation function. * @param act The activation function.
* @param computeType Specifies the data precision for the matrix multiply.
*/ */
MatmulObj(GraphObj *graph, Tensor A, Tensor B, Tensor C, MatmulObj(GraphObj *graph, Tensor A, Tensor B, Tensor C,
bool transA = false, bool transB = false, Tensor bias = nullptr, bool transA = false, bool transB = false, Tensor bias = nullptr,
ActType act = ActType::None); ActType act = ActType::None, std::string computeType = "default");
OP_CLONE(MatmulObj); OP_CLONE(MatmulObj);
std::string toString() const override; std::string toString() const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int numInputs() const override { return inputs.size(); } int numInputs() const override { return inputs.size(); }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }
@ -60,6 +64,7 @@ class MatmulObj : public OperatorObj {
int getN() const { return n; } int getN() const { return n; }
int getK() const { return k; } int getK() const { return k; }
auto getBMNK() const { return tuple{b, m, n, k}; } auto getBMNK() const { return tuple{b, m, n, k}; }
std::string getComputeType() const { return computeType; }
private: private:
vector<int> getWorkloadVector() const override; vector<int> getWorkloadVector() const override;

View File

@ -21,7 +21,7 @@ class MemBoundObj : public OperatorObj {
OP_CLONE(MemBoundObj); OP_CLONE(MemBoundObj);
std::string toString() const override; std::string toString() const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int numInputs() const override { return inputs.size(); } int numInputs() const override { return inputs.size(); }
int numOutputs() const override { return outputs.size(); } int numOutputs() const override { return outputs.size(); }

View File

@ -27,7 +27,7 @@ class PadObj : public OperatorObj {
const vector<int> &pads, const optional<vector<int>> &axes); const vector<int> &pads, const optional<vector<int>> &axes);
OP_CLONE(PadObj); OP_CLONE(PadObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }

View File

@ -12,6 +12,7 @@ class PoolingObj : public OperatorObj {
int dh, dw; int dh, dw;
int ph, pw; int ph, pw;
int sh, sw; int sh, sw;
int ceilMode;
int n, c, h, w; int n, c, h, w;
public: public:
@ -32,12 +33,15 @@ class PoolingObj : public OperatorObj {
* @param pw Padding at the width dimension. * @param pw Padding at the width dimension.
* @param sh Stride at the height dimension. * @param sh Stride at the height dimension.
* @param sw Stride at the width dimension. * @param sw Stride at the width dimension.
* @param ceilMode Whether to use ceil(1) or floor(0) to compute the output
* shape.
*/ */
PoolingObj(GraphObj *graph, OpType optype, Tensor input, Tensor output, PoolingObj(GraphObj *graph, OpType optype, Tensor input, Tensor output,
int kh, int kw, int dh, int dw, int ph, int pw, int sh, int sw); int kh, int kw, int dh, int dw, int ph, int pw, int sh, int sw,
int ceilMode);
OP_CLONE(PoolingObj); OP_CLONE(PoolingObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }
@ -50,6 +54,7 @@ class PoolingObj : public OperatorObj {
int getPw() const { return pw; } int getPw() const { return pw; }
int getSh() const { return sh; } int getSh() const { return sh; }
int getSw() const { return sw; } int getSw() const { return sw; }
int getCeilMode() const { return ceilMode; }
auto getPadStrideDilation() const { return tuple(ph, pw, sh, sw, dh, dw); } auto getPadStrideDilation() const { return tuple(ph, pw, sh, sw, dh, dw); }
auto getNCHWRS() const { return tuple(n, c, h, w, kh, kw); } auto getNCHWRS() const { return tuple(n, c, h, w, kh, kw); }
@ -62,15 +67,15 @@ class PoolingObj : public OperatorObj {
class MaxPoolObj : public PoolingObj { class MaxPoolObj : public PoolingObj {
public: public:
MaxPoolObj(GraphObj *graph, Tensor input, Tensor output, int kh, int kw, MaxPoolObj(GraphObj *graph, Tensor input, Tensor output, int kh, int kw,
int dh, int dw, int ph, int pw, int sh, int sw) int dh, int dw, int ph, int pw, int sh, int sw, int ceilMode)
: PoolingObj(graph, OpType::MaxPool, input, output, kh, kw, dh, dw, ph, : PoolingObj(graph, OpType::MaxPool, input, output, kh, kw, dh, dw, ph,
pw, sh, sw) {} pw, sh, sw, ceilMode) {}
}; };
class AvgPoolObj : public PoolingObj { class AvgPoolObj : public PoolingObj {
public: public:
AvgPoolObj(GraphObj *graph, Tensor input, Tensor output, int kh, int kw, AvgPoolObj(GraphObj *graph, Tensor input, Tensor output, int kh, int kw,
int dh, int dw, int ph, int pw, int sh, int sw) int dh, int dw, int ph, int pw, int sh, int sw, int ceilMode)
: PoolingObj(graph, OpType::AveragePool, input, output, kh, kw, dh, dw, : PoolingObj(graph, OpType::AveragePool, input, output, kh, kw, dh, dw,
ph, pw, sh, sw) {} ph, pw, sh, sw, ceilMode) {}
}; };
}; // namespace infini }; // namespace infini

46
include/operators/recv.h Normal file
View File

@ -0,0 +1,46 @@
#pragma once
#include "core/operator.h"
namespace infini {
/**
*
* https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2193/user-guide/docs/index.html
*/
class RecvObj : public OperatorObj {
public:
/**
* @brief Construct a new SendRecv object
*
* @param graph The computation graph that this operator belongs to.
* @param input default nullptr, because recv does not have input.
* @param output recv output
* @param source the send rank
* @param destination the recv rank
* @param dims The shape of the output tensor.
*/
RecvObj(GraphObj *graph, Tensor output, int source, int destination,
Shape dims, int outputType, Tensor input = nullptr);
OP_CLONE(RecvObj);
int numInputs() const override { return inputs.size(); }
int numOutputs() const override { return 1; }
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
DataType getDType() const;
int getSourceRank() const { return source; }
int getDestinationRank() const { return destination; }
inline Shape getShape() const { return dims; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
vector<DataType> inferDataType(const TensorVec &inputs) const override;
protected:
int source;
int destination;
Shape dims;
int outputType;
};
} // namespace infini

View File

@ -3,27 +3,30 @@
namespace infini { namespace infini {
/** /**
* @brief Compute the mean of input tensor's elements along certain axes. * @brief Compute the reduction of input tensor's elements along certain axes.
* *
*/ */
class ReduceMeanObj : public OperatorObj { class ReduceBaseObj : public OperatorObj {
protected:
set<int> axes; // axis to reduce set<int> axes; // axis to reduce
bool keepDims; bool keepDims;
public: public:
/** /**
* @brief Construct a new ReduceMean object. * @brief Construct a new Reduce object.
* *
* @param graph The computation graph that this operator belongs to. * @param graph The computation graph that this operator belongs to.
* @param opType The operation type. Should be a Reduce operation.
* @param input The input tensor. * @param input The input tensor.
* @param output The output tensor. * @param output The output tensor.
* @param axes Axes to reduce. * @param axes Axes to reduce.
* @param keepDims Keep the reduced dimensions or not. * @param keepDims Keep the reduced dimensions or not.
*/ */
ReduceMeanObj(GraphObj *graph, Tensor input, Tensor output, ReduceBaseObj(GraphObj *graph, OpType opType, Tensor input, Tensor output,
const optional<vector<int>> &axes, bool keepDims = true); const optional<vector<int>> &axes, bool keepDims);
OP_CLONE(ReduceMeanObj); virtual ~ReduceBaseObj() {}
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; OP_CLONE(ReduceBaseObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
@ -38,4 +41,15 @@ class ReduceMeanObj : public OperatorObj {
vector<int> getOpAttrVector() const override; vector<int> getOpAttrVector() const override;
}; };
class ReduceMeanObj : public ReduceBaseObj {
public:
ReduceMeanObj(GraphObj *graph, Tensor input, Tensor output,
const optional<vector<int>> &axes, bool keepDims = true);
};
class ReduceSumObj : public ReduceBaseObj {
public:
ReduceSumObj(GraphObj *graph, Tensor input, Tensor output,
const optional<vector<int>> &axes, bool keepDims = true);
};
} // namespace infini } // namespace infini

View File

@ -9,6 +9,7 @@ namespace infini {
*/ */
class ReshapeObj : public OperatorObj { class ReshapeObj : public OperatorObj {
Shape dims; Shape dims;
Shape outputShape;
public: public:
/** /**
@ -17,18 +18,20 @@ class ReshapeObj : public OperatorObj {
* @param graph The computation graph that this operator belongs to. * @param graph The computation graph that this operator belongs to.
* @param input The input tensor. * @param input The input tensor.
* @param output The output tensor. * @param output The output tensor.
* @param dims The shape of the output tensor. * @param dims The shape to infer the output shape.
* @param outputShape The real shape of output tensor.
*/ */
ReshapeObj(GraphObj *graph, Tensor input, Tensor output, Shape dims); ReshapeObj(GraphObj *graph, Tensor input, Tensor output, Shape dims);
OP_CLONE(ReshapeObj); OP_CLONE(ReshapeObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }
inline Shape getShape() const { return dims; } inline Shape getShape() const { return outputShape; }
inline Shape getDims() const { return dims; }
private: private:
vector<int> getWorkloadVector() const override; vector<int> getWorkloadVector() const override;
@ -55,7 +58,7 @@ class FlattenObj : public OperatorObj {
FlattenObj(GraphObj *graph, Tensor input, Tensor output, int axis); FlattenObj(GraphObj *graph, Tensor input, Tensor output, int axis);
OP_CLONE(FlattenObj); OP_CLONE(FlattenObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
@ -85,7 +88,7 @@ class IdentityObj : public OperatorObj {
IdentityObj(GraphObj *graph, Tensor input, Tensor output); IdentityObj(GraphObj *graph, Tensor input, Tensor output);
OP_CLONE(IdentityObj); OP_CLONE(IdentityObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }

View File

@ -27,6 +27,60 @@ class ResizeObj : public OperatorObj {
enum class EKeepAspectRatioPolicy { stretch, notLarger, notSmaller, none }; enum class EKeepAspectRatioPolicy { stretch, notLarger, notSmaller, none };
enum class ECoeffMode { nearest, linear, cubic }; enum class ECoeffMode { nearest, linear, cubic };
static ECoordinateTransMode fromECoordinateTransModeStr(string mode) {
if (mode == "half_pixel") {
return ECoordinateTransMode::halfPixel;
} else if (mode == "asymmetric") {
return ECoordinateTransMode::asymmetric;
} else if (mode == "align_corners") {
return ECoordinateTransMode::alignCorners;
} else if (mode == "pytorch_half_pixel") {
return ECoordinateTransMode::pytorchHalfPixel;
} else if (mode == "tf_crop_and_resize") {
return ECoordinateTransMode::tfCropAndResize;
} else {
IT_TODO_HALT();
}
}
static ENearestMode fromENearestModeStr(string mode) {
if (mode == "round_prefer_floor") {
return ENearestMode::roundPreferFloor;
} else if (mode == "round_prefer_ceil") {
return ENearestMode::roundPreferCeil;
} else if (mode == "floor") {
return ENearestMode::floor;
} else if (mode == "ceil") {
return ENearestMode::ceil;
} else {
return ENearestMode::none;
}
}
static EKeepAspectRatioPolicy fromRatioPolicyStr(string ratioPolicyStr) {
if (ratioPolicyStr == "stretch") {
return EKeepAspectRatioPolicy::stretch;
} else if (ratioPolicyStr == "not_larger") {
return EKeepAspectRatioPolicy::notLarger;
} else if (ratioPolicyStr == "not_smaller") {
return EKeepAspectRatioPolicy::notSmaller;
} else {
return EKeepAspectRatioPolicy::none;
}
}
static ECoeffMode fromECoeffModeStr(string mode) {
if (mode == "nearest") {
return ECoeffMode::nearest;
} else if (mode == "linear") {
return ECoeffMode::linear;
} else if (mode == "cubic") {
return ECoeffMode::cubic;
} else {
IT_TODO_HALT();
}
}
private: private:
vector<int> axes; vector<int> axes;
vector<float> scales; vector<float> scales;
@ -60,7 +114,7 @@ class ResizeObj : public OperatorObj {
// Operator clone(TensorVec inputs, TensorVec outputs) override; // Operator clone(TensorVec inputs, TensorVec outputs) override;
vector<DataType> inferDataType(const TensorVec &inputs) const override; vector<DataType> inferDataType(const TensorVec &inputs) const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return inputs.size(); } int numInputs() const override { return inputs.size(); }
int numOutputs() const override { return 1; } int numOutputs() const override { return 1; }

View File

@ -0,0 +1,34 @@
#pragma once
#include "core/operator.h"
namespace infini {
/**
* @brief Fused RMSNorm Operator
*
*/
class RMSNormObj : public OperatorObj {
int dim;
public:
/**
* @brief Construct a new RMSNorm object.
*
* @param graph The computation graph that this operator belongs to.
* @param input The input tensor.
* @param output The output tensor.
*/
RMSNormObj(GraphObj *graph, Tensor input, Tensor weight, Tensor output);
OP_CLONE(RMSNormObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return 2; }
int numOutputs() const override { return 1; }
int getDim() const { return dim; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

29
include/operators/rope.h Normal file
View File

@ -0,0 +1,29 @@
#pragma once
#include "core/operator.h"
namespace infini {
class RoPEObj : public OperatorObj {
public:
/**
* @brief Construct a new RotaryEmbedding object.
*
* @param graph The computation graph that this operator belongs to.
* @param pos The positon id of the query.
* @param input The input tensor.
* @param output The output tensor.
*/
RoPEObj(GraphObj *graph, Tensor pos, Tensor input, Tensor output);
OP_CLONE(RoPEObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return 2; }
int numOutputs() const override { return 1; }
DataType getDType() const { return getInputs(1)->getDType(); }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

42
include/operators/send.h Normal file
View File

@ -0,0 +1,42 @@
#pragma once
#include "core/operator.h"
namespace infini {
/**
*
* https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2193/user-guide/docs/index.html
*/
class SendObj : public OperatorObj {
public:
/**
* @brief Construct a new SendRecv object
*
* @param graph The computation graph that this operator belongs to.
* @param input send input
* @param output recv output
* @param source the send rank
* @param destination the recv rank
*/
SendObj(GraphObj *graph, Tensor input, int source, int destination,
Tensor output = nullptr);
OP_CLONE(SendObj);
int numInputs() const override { return 1; }
int numOutputs() const override { return outputs.size(); }
std::string toString() const override;
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
int getSourceRank() const { return source; }
int getDestinationRank() const { return destination; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
vector<DataType> inferDataType(const TensorVec &inputs) const override;
protected:
int source;
int destination;
};
} // namespace infini

View File

@ -32,7 +32,7 @@ class SliceObj : public OperatorObj {
const optional<vector<int>> &steps); const optional<vector<int>> &steps);
OP_CLONE(SliceObj); OP_CLONE(SliceObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
inline int numInputs() const override { return 1; } inline int numInputs() const override { return 1; }
inline int numOutputs() const override { return 1; } inline int numOutputs() const override { return 1; }

View File

@ -10,7 +10,7 @@ class SoftmaxObj : public OperatorObj {
OP_CLONE(SoftmaxObj); OP_CLONE(SoftmaxObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override { optional<vector<Shape>> inferShape(const TensorVec &inputs) override {
return {{inputs[0]->getDims()}}; return {{inputs[0]->getDims()}};
}; };

View File

@ -37,7 +37,7 @@ class SplitObj : public OperatorObj {
int dim, const vector<int> &ratio); int dim, const vector<int> &ratio);
OP_CLONE(SplitObj); OP_CLONE(SplitObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }

View File

@ -0,0 +1,39 @@
#pragma once
#include "core/operator.h"
namespace infini {
/**
* @brief Remove single-dimensional entries from the shape of a tensor.
*
*/
class SqueezeObj : public OperatorObj {
Shape axes;
public:
/**
* @brief Construct a new Squeeze object.
*
* @param graph The computation graph that this operator belongs to.
* @param input The input tensor.
* @param output The output tensor.
* @param axes List of integers indicating the dimensions to squeeze.
*/
SqueezeObj(GraphObj *graph, Tensor input, Tensor output, Shape axes);
OP_CLONE(SqueezeObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return 1; }
int numOutputs() const override { return 1; }
inline Shape getAxes() const { return axes; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

View File

@ -7,7 +7,7 @@ class TransposeObj : public OperatorObj {
TransposeObj(GraphObj *graph, Tensor input, Tensor output, TransposeObj(GraphObj *graph, Tensor input, Tensor output,
vector<int> permute); vector<int> permute);
OP_CLONE(TransposeObj); OP_CLONE(TransposeObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
@ -19,4 +19,33 @@ class TransposeObj : public OperatorObj {
vector<int> getWorkloadVector() const override; vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override; vector<int> getOpAttrVector() const override;
}; };
}; // namespace infini
class DepthToSpaceObj : public OperatorObj {
public:
DepthToSpaceObj(GraphObj *graph, Tensor input, Tensor output, int blocksize,
std::string mode);
OP_CLONE(DepthToSpaceObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return 1; }
int numOutputs() const override { return 1; }
int getBlockSize() const { return blockSize; }
int getMode() const { return D2SMode; }
auto getModeString() const { return D2SModeString; }
auto getReshapeDim() const { return reshapeDim; }
auto getTransposeDim() const { return transposeDim; }
auto getOutDim() const { return outDim; }
private:
int blockSize;
int D2SMode;
std::string D2SModeString;
mutable std::vector<int> reshapeDim = {1, 1, 1, 1, 1, 1};
mutable std::vector<int> transposeDim = {1, 1, 1, 1, 1, 1};
mutable std::vector<int> outDim = {1, 1, 1, 1};
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

View File

@ -17,7 +17,7 @@ class UnaryObj : public OperatorObj {
* @param output The output tensor. * @param output The output tensor.
*/ */
UnaryObj(OpType type, GraphObj *graph, Tensor input, Tensor output); UnaryObj(OpType type, GraphObj *graph, Tensor input, Tensor output);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
@ -33,7 +33,7 @@ class ClipObj : public OperatorObj {
ClipObj(GraphObj *graph, Tensor input, Tensor output, ClipObj(GraphObj *graph, Tensor input, Tensor output,
std::optional<float> min, std::optional<float> max); std::optional<float> min, std::optional<float> max);
OP_CLONE(ClipObj); OP_CLONE(ClipObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
std::optional<float> getMin() const { return minValue; }; std::optional<float> getMin() const { return minValue; };
@ -52,7 +52,7 @@ class HardtanhObj : public OperatorObj {
HardtanhObj(GraphObj *graph, Tensor input, Tensor output, float min, HardtanhObj(GraphObj *graph, Tensor input, Tensor output, float min,
float max); float max);
OP_CLONE(HardtanhObj); OP_CLONE(HardtanhObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
float getMin() const { return minValue; }; float getMin() const { return minValue; };
@ -70,7 +70,7 @@ class FlipObj : public OperatorObj {
public: public:
FlipObj(GraphObj *graph, Tensor input, Tensor output, vector<int> axis); FlipObj(GraphObj *graph, Tensor input, Tensor output, vector<int> axis);
OP_CLONE(FlipObj); OP_CLONE(FlipObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
vector<int> getAxis() const { return axisValue; }; vector<int> getAxis() const { return axisValue; };
@ -87,7 +87,7 @@ class FillObj : public OperatorObj {
public: public:
FillObj(GraphObj *graph, Tensor input, Tensor output, float value); FillObj(GraphObj *graph, Tensor input, Tensor output, float value);
OP_CLONE(FillObj); OP_CLONE(FillObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
float getValue() const { return setValue; }; float getValue() const { return setValue; };
@ -104,7 +104,7 @@ class L2LossObj : public OperatorObj {
public: public:
L2LossObj(GraphObj *graph, Tensor input, Tensor output); L2LossObj(GraphObj *graph, Tensor input, Tensor output);
OP_CLONE(L2LossObj); OP_CLONE(L2LossObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
@ -120,7 +120,7 @@ class TransformObj : public OperatorObj {
TransformObj(GraphObj *graph, Tensor input, Tensor output, float alpha, TransformObj(GraphObj *graph, Tensor input, Tensor output, float alpha,
float beta); float beta);
OP_CLONE(TransformObj); OP_CLONE(TransformObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
float getAlpha() const { return alphaValue; } float getAlpha() const { return alphaValue; }
@ -159,13 +159,14 @@ enum class CastType {
Uint322Int64, Uint322Int64,
Float162Float, Float162Float,
BFloat162Float, BFloat162Float,
Float2Float,
}; };
class CastObj : public OperatorObj { class CastObj : public OperatorObj {
public: public:
CastObj(GraphObj *graph, Tensor input, Tensor output, CastType type); CastObj(GraphObj *graph, Tensor input, Tensor output, CastType type);
OP_CLONE(CastObj); OP_CLONE(CastObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
vector<DataType> inferDataType(const TensorVec &inputs) const override; vector<DataType> inferDataType(const TensorVec &inputs) const override;
std::string toString() const override; std::string toString() const override;
@ -185,7 +186,7 @@ class CumsumObj : public OperatorObj {
CumsumObj(GraphObj *graph, Tensor input, Tensor output, int axis, CumsumObj(GraphObj *graph, Tensor input, Tensor output, int axis,
bool exclusive, bool reverse); bool exclusive, bool reverse);
OP_CLONE(CumsumObj); OP_CLONE(CumsumObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int getAxis() const { return axisValue; } int getAxis() const { return axisValue; }
@ -205,7 +206,7 @@ class ShapeObj : public OperatorObj {
public: public:
ShapeObj(GraphObj *graph, Tensor input, Tensor output); ShapeObj(GraphObj *graph, Tensor input, Tensor output);
OP_CLONE(ShapeObj); OP_CLONE(ShapeObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 1; } int numInputs() const override { return 1; }
@ -216,7 +217,7 @@ class PReluObj : public OperatorObj {
public: public:
PReluObj(GraphObj *graph, Tensor input, Tensor alpha, Tensor output); PReluObj(GraphObj *graph, Tensor input, Tensor alpha, Tensor output);
OP_CLONE(PReluObj); OP_CLONE(PReluObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return 2; } int numInputs() const override { return 2; }
@ -236,7 +237,7 @@ class LogObj : public OperatorObj {
}; };
LogObj(GraphObj *graph, Tensor input, Tensor output, LogType type); LogObj(GraphObj *graph, Tensor input, Tensor output, LogType type);
OP_CLONE(LogObj); OP_CLONE(LogObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
LogType getType() const { return logType; } LogType getType() const { return logType; }
@ -258,10 +259,14 @@ class LogObj : public OperatorObj {
}; };
DEFINE_UNARY_OBJ(Relu, OpType::Relu) DEFINE_UNARY_OBJ(Relu, OpType::Relu)
DEFINE_UNARY_OBJ(Silu, OpType::Silu)
DEFINE_UNARY_OBJ(Gelu, OpType::Gelu)
DEFINE_UNARY_OBJ(Sigmoid, OpType::Sigmoid) DEFINE_UNARY_OBJ(Sigmoid, OpType::Sigmoid)
DEFINE_UNARY_OBJ(Tanh, OpType::Tanh) DEFINE_UNARY_OBJ(Tanh, OpType::Tanh)
// DEFINE_UNARY_OBJ(Softmax, OpType::Softmax) // DEFINE_UNARY_OBJ(Softmax, OpType::Softmax)
DEFINE_UNARY_OBJ(Abs, OpType::Abs) DEFINE_UNARY_OBJ(Abs, OpType::Abs)
DEFINE_UNARY_OBJ(HardSigmoid, OpType::HardSigmoid)
DEFINE_UNARY_OBJ(HardSwish, OpType::HardSwish)
DEFINE_UNARY_OBJ(Sin, OpType::Sin) DEFINE_UNARY_OBJ(Sin, OpType::Sin)
DEFINE_UNARY_OBJ(Cos, OpType::Cos) DEFINE_UNARY_OBJ(Cos, OpType::Cos)

View File

@ -0,0 +1,38 @@
#pragma once
#include "core/operator.h"
namespace infini {
/**
* @brief nsert single-dimensional entries to the shape of an input tensor.
*
*/
class UnsqueezeObj : public OperatorObj {
Shape axes;
public:
/**
* @brief Construct a new Unsqueeze object.
*
* @param graph The computation graph that this operator belongs to.
* @param input The input tensor.
* @param output The output tensor.
* @param axes List of integers indicating the dimensions to be inserted.
*/
UnsqueezeObj(GraphObj *graph, Tensor input, Tensor output, Shape axes);
OP_CLONE(UnsqueezeObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override;
int numInputs() const override { return 1; }
int numOutputs() const override { return 1; }
inline Shape getAxes() const { return axes; }
private:
vector<int> getWorkloadVector() const override;
vector<int> getOpAttrVector() const override;
};
} // namespace infini

View File

@ -22,7 +22,7 @@ class WhereObj : public OperatorObj {
Tensor output); Tensor output);
OP_CLONE(WhereObj); OP_CLONE(WhereObj);
optional<vector<Shape>> inferShape(const TensorVec &inputs) const override; optional<vector<Shape>> inferShape(const TensorVec &inputs) override;
std::string toString() const override; std::string toString() const override;
int numInputs() const override { return inputs.size(); } int numInputs() const override { return inputs.size(); }

View File

@ -1,14 +0,0 @@
#pragma once
namespace infini {
void broadcastShape(const Shape &originShape, SmallArray &modifyShape,
int nDims, int size) {
for (int i = nDims - 1; i >= 0; --i) {
modifyShape.data[i] = 1;
}
for (int i = size - 1; i >= 0; --i) {
modifyShape.data[i + nDims - size] = originShape[i];
}
}
} // namespace infini

Some files were not shown because too many files have changed in this diff Show More