Commit Graph

199 Commits

Author SHA1 Message Date
zhangyunze 331f7ab2b8
support Dynamic tensor infer shape and fix memory pool (#176)
* feat: support dynamic tensor part1

* feat: support dynamic-tensor part2

* feat: support dynamic tensor part 3

* fix: fix some ..

* - add kvcache example

* feat: support concat to identity kernel

* add a simple mempory pool for allocator

* fix: rebase to master

* fix bug after merging

* - remove outdated script

* fix: fix as review

---------

Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-23 13:11:50 +08:00
xiaonans 965df4e294
[feature] add fused attention_kvcache operator support (#179)
* [feature] add fused attention_kvcache operator support

* add test to attention_kvcache op

* Add space line at EOF

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-14 23:44:22 +08:00
Hardy f22fa2766e
add reduce_mean and gather on bang (#167)
* add code

* fix reduce_mean

* add softmax on BANG

* fix gather

* fix boradcast on ele kernel when dim size is zero

* add where kernel and fix softmax kernel

* fix convbpdata bug

* fix format

---------

Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 18:02:44 +08:00
Hardy 50862df765
[Kunlun & CUDA & BANG] add depth2space operator (#178)
* add depth2space operator

* fix format

* add depth2space on cambricon bang

* add depth2space on gpu

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 17:58:26 +08:00
Hardy 1ea450882b
add reduce_mean and gather on kunlun (#169)
* add reduce_mean and gather

* fix format

* fix gather

* fix

* fix xpu, add where operation, fix element-wise operation

* fix format

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 17:52:09 +08:00
xgqdut2016 d3e7543291
Cuda softmax (#129)
* "add softmax.cu,.cc,.h"

* Modify cuda softmax

* "modified the introduction of softmax.cu"

* "add format of cuda_softmax.h"

* "modified where.cc(.cu,.h) and softmax.cu"

* "modified format"

* Fix cpu softmax kernel

* "modified the // introduction of softmax.cu"

* "modified softmax.cu and use 1D block"

* "modified softmax.cu,format, and use 1D block"

* "introduce share mem to speed softmax"

* "reduce the input of function"

* modified the format

* remodify 2D block softmax

* remodify 1D block softmax

* modified the share memory

* add warp reduce

* conflict solve two

* remove extra space line

* solve comment

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
2023-11-06 08:56:23 +08:00
Derui Yang 1a6fccccbe
test: 支持编译 einnet 单元测试,但不是所有测试都能通过 (#174)
* test: 支持编译 einnet 单元测试,但不是所有测试都能通过

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* Fix: locating resource files and skip codegen

- Change the path parameters in `matchExprResult` and `checkExprLogSame` to paths relative to the project home
- Skip NNetMemboundOp tests as they require codegen

---------

Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2023-11-03 13:21:49 +08:00
xgqdut2016 ec3adf6fa7
support 8D tensor, add test example (#170)
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-31 10:47:36 +08:00
Bolun Zhang 23b825efc4
Xpu task4 support: add softmax (#172)
* add softmax on kunlun

* format

---------

Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-30 16:01:05 +08:00
constroy Li feccd4f318
fix tensor parallel for llama (#159)
* fix Slice

* change default rounds of timeit to 10 to reduce time

* fix slice with large ends

* Reshape support Int64

* support position_ids as input

* skip last MatMul in Llama

* skip infer_shapes to parse large model

* update launch.py

* fix split_concat_kernel

* print more message in launch.py

* Reshape supports both Int32 and Int64

* try infer_shapes and warn about failure

* fix format

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-30 15:04:16 +08:00
Haojie Wang 7f5188bedd
remove dimension limit of elementwise operators on xpu (#168) 2023-10-25 14:38:47 +08:00
baominghelly 07ef587c65
Change onnx-simplifier to onnxsim to resolve build issue on xpu (#164) 2023-10-21 02:58:32 +08:00
Derui Yang d0f9792613
Fix: add building option for NNet (#162)
Signed-off-by: YdrMaster <ydrml@hotmail.com>
2023-10-16 19:53:28 +08:00
Hardy 1184fa131f
Xpu (#82)
* support kunlun xpu and add an operator named Add

* add sub, mul, div, pow, maximum, minimum

* add code

* add xpu code

* add code

* add matmul

* add transpose

* add unary operator

* add unary operator

* add some operator

* add code

* support run resnet18 on xpu

* add code

* add max pool2d

* fix xpu code, let it can run.

* 添加XPU算子 (#120)

* add floordiv for xpu

* add batchnorm for xpu

* add more cast types for xpu

* add conv_trans for xpu

* add pad for xpu

* add logical ops for xpu

* fix format for xpu src and include

* fix format for xpu test

* fix format for xpu src

---------

Co-authored-by: Bolun <bolunz@u.nus.edu>

* Xpu abs (#121)

* add: unary kernel for xpu

* formatting

* format

* format

* format

* fix: pointer jump

* fix optype comments

* fix bug introduced while resolving conflict

* change cmake option for kunlunxin xpu from 'xpu' to 'kunlun'; fix bug after merging distributed infrastructure

* Add doc support for xpu (#141)

* fix

* fix

* fix pooling test

* format

* format

* fix

* fix

* set cmake version requirement

* fix cmakelists

* rename xpu to kunlun

* fix

* fix format

* fix format

* fix format

* fix change name to kunlun

* format

* fix format

* clang format

* fix format

---------

Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: zhangyue207 <138768300+zhangyue207@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: baominghelly <41820386+baominghelly@users.noreply.github.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
2023-10-16 10:57:08 +08:00
Haojie Wang 8e4d88fb9f
add transpose, concat and split for native cpu (#158) 2023-10-12 10:14:28 +08:00
PanZezhong1725 36ae7b7fb6
Add GatherElements op and cuda kernel (#149)
* Add GatherElements op and cuda kernel

* fix format

* remove print

* remove unused var

* fix spacing

* fix format

---------

Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-12 09:18:12 +08:00
PanZezhong1725 ed3034f878
Add HardSigmoid and HardSwish (#156)
* Add HardSigmoid and HardSwish

* fix format
2023-10-10 22:41:06 +08:00
kilinchange 1151101fb9
add naive allocator for debugging (#140)
* add naive allocator only for debugging

* merge redundant api

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-10 16:42:23 +08:00
Haojie Wang 90b9a80f72
add onnx simplify (#153)
* add onnx simplify

* fix test bug

* update ci policy

* fix onnx simpilfy bug

* update ci workflow
2023-10-10 15:45:27 +08:00
ChengXiang Qi 7f16fa353e
【Hackathon No.108】Add Gelu operator, ffi, kernel for cpu and gpu. (#148)
feat: Add Gelu kernel, operator, ffi.
2023-10-10 15:21:13 +08:00
PanZezhong1725 7600fe688c
Add Neg operator and kernel (#152)
* Add Neg operator and kernel

* handle neg in to_onnx

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-10 10:54:56 +08:00
Haojie Wang 7a9fcd93b2
Pooling ceil mode (#155)
* add ceil mode for pooling

* do not print debug info for allocator by default

* fix test bugs after introducing pooling ceil mode

* fix onnx import bug
2023-10-09 20:51:39 +08:00
PanZezhong1725 785853b0a3
Add erf kernel for cpu and gpu (#147)
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
2023-10-09 09:36:55 +08:00
Haojie Wang c0ff584e04
add constant op; fix concat bug (#151) 2023-10-08 21:42:41 +08:00
Haojie Wang f25bcca076
add python examples (#143)
* add python examples

* use copy*_numpy instead of copy*_float
2023-09-28 10:40:45 +08:00
kilinchange 877db21021
Fix support kvcache (#142)
* - fix onnx.py

* - fix shard_concat
2023-09-27 11:08:44 +08:00
PanZezhong1725 62be816f53
修复split concat当dim=0结果出错的问题 (#138)
Fix split_concat kernel not supporting dim=0

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-25 10:25:54 +08:00
Haojie Wang 8f2597a508
fix bang runtime bug after merging distributed branch (#137) 2023-09-19 14:10:39 +08:00
kilinchange 48ec730579
Support kvcache (#134)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* export min/max to python

* fix MatMul

* modify launch.py to run opt

* hack to treat ReduceSum as AllReduceSum

* throw exception in cuda error

* fix parallel_opt.py

* improve the error prompt and cuda error check

* fix GatherObj::GatherObj member init

* fix size calculation for scalar (rank = 0) tensor

* MatMul supports bias

* fix add bias for row parallel gemm

* add --gen_std to launch.py

* fix AllReduceNCCL

* update launch.py

* less log

* update parallel_opt

* update launch.py

* add __eq__ for Placement sub-classes

* less benchmark run

* fix placement infer for matmul

* fix vacabuary size

* fix Exception

* Add shard tensor with group to support gpt2

* Add find successor function to find split op at different depth

* recover CommunicatorObj

* improve error mesasge

* optimize parallel_opt.py

* optimize launch.py

* recover docs for all_reduce and all_gather

* - support concat for kvcache

* - modify allocator

* - add tensorType
- modify allocator to support memory allocation based on tensorType

* - fix allocator init

* - support kvcache by running 2 stub distributively

* - fix name

* - remove unused flag

* - fix wrong pb name

* - fix as constroy suggessed

* - fix launch.py format

---------

Co-authored-by: constroy <constroy.li@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
2023-09-18 14:17:02 +08:00
PanZezhong1725 c6b82cfda0
Copyout numpy接口 (#135)
* Add copy out numpy interface, delete returning buffer directly, add api test

* Add dtype interface
2023-09-15 16:40:44 +08:00
constroy Li 4c321c8a91
tensor parallel for transformer (#125)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* export min/max to python

* fix MatMul

* modify launch.py to run opt

* hack to treat ReduceSum as AllReduceSum

* throw exception in cuda error

* fix parallel_opt.py

* improve the error prompt and cuda error check

* fix GatherObj::GatherObj member init

* fix size calculation for scalar (rank = 0) tensor

* MatMul supports bias

* fix add bias for row parallel gemm

* add --gen_std to launch.py

* fix AllReduceNCCL

* update launch.py

* less log

* update parallel_opt

* update launch.py

* add __eq__ for Placement sub-classes

* less benchmark run

* fix placement infer for matmul

* fix vacabuary size

* fix Exception

* Add shard tensor with group to support gpt2

* Add find successor function to find split op at different depth

* recover CommunicatorObj

* improve error mesasge

* optimize parallel_opt.py

* optimize launch.py

* recover docs for all_reduce and all_gather

* Fix API

* fix format

---------

Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-14 14:19:45 +08:00
xgqdut2016 dda668fd16
"modified where" (#131)
* "modified where"

* "adapt int or bool condition datatype"

* "add broadcast_shape.h,error"

* add broadcast.h

* "modified broadcast_shape.h and where.cc,.cu"
2023-09-14 10:45:57 +08:00
constroy Li f60767a770
impl distributed launch with NCCL (#106)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* Add broadcast operator and cuda kernel

* Add comments for operators

* remove const of class member

* move communicator to CudaRuntimeObj

* Add an empty line at EOF.

---------

Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-05 09:47:35 +08:00
Hardy b4eda85e67
Fix mlu (#87)
* fix some operator code

* fix some code of mlu operator

* fix some code of cast and elementwise

* clang format

* remove copy kernel

* fix cast

* fix clang-format

---------

Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-04 08:33:28 +08:00
PanZezhong1725 2412c25e67
Issue 107: Add copyin Numpy and covertion to Numpy (#126)
* Add copyin_numpy and to_numpy for pybind TensorObj

* fix copyin size assertion

* fix size calculation for scalar (rank = 0) tensor

* Use pybind buffer instead of returning array

* fix format
2023-09-01 11:20:26 +08:00
zhangyunze 3e6ef305f1
框架支持bert/gpt2模型构图 (#94)
* feat: support to sqrt op

* feat: support to erf op

* feat: support to expand op

* feat: support to where op

* fix: gather op index can be int64_t(hard coding)

* fix: some wrong use

* style: fix the format style

* test: add test for change op

* fix: rebase to master

* fix: fix matmul b compute wrong

* add expand and where kernel

* Add int64 support for cuda gather kernel

* add test_where.cc

* add "expand.(cu/cc,test,cuda),modified where.cu"

* Separate initialization of datatypes to avoid compile error

* modify where.(cu/cc/h,test), expand and clip

* Format fix

* Format fix

---------

Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-08-29 16:06:52 +08:00
ChengXiang Qi d8ffd8a4b7
feat(env): add docker support. (#122)
This PR adds Docker support for running this project, and it primarily
accomplishes the following tasks:
- Added the necessary `Dockerfile` for running the project on CPU
environment.
- Added commands to the `Makefile` for convenient Docker startup.
- Added documentation in `docs/INSTALL_GUIDE_CN.md` explaining how to
launch the Docker environment.
2023-08-28 18:34:36 +08:00
kuangjux a8a5c037ca feat(env): add docker support.
- Added the necessary `Dockerfile` for running the project on CPU and CUDA environment.
- Added commands to the `Makefile` for convenient Docker startup.
- Added documentation in `docs/INSTALL_GUIDE_CN.md` explaining how to launch the Docker environment.

Co-authored-by: Xiaonan Song <songxiaonan@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-08-28 16:28:09 +08:00
PanZezhong1725 69fd251e5d
Fix kernel arguments, add debug mode (#119)
Add debug mode macro in cmakelist.
2023-08-28 08:58:38 +08:00
panzezhong 0ce7e7651f Fix kernel arguments, add debug mode 2023-08-24 13:39:22 +08:00
constroy Li 1e91979c76
add CUDNN impl for Min and Max (#118)
* add cudnn impl for Min and Max

* fix onnx _search_shape with output shape
2023-08-22 16:19:29 +08:00
zhangyunze 1438f14a25
fix: fix castType mlu (#117) 2023-08-22 14:54:32 +08:00
PanZezhong1725 9cf6c30e1c
Add cuda transpose kernel (#115)
* Add cuda transpose kernel

* Empty line cuda_transpose.h

* Empty line small_array.h

* empty line transpose.cc

* empty line transpose.cu

* empty line test_cuda_transpose.cc
2023-08-22 14:22:15 +08:00
constroy Li 384407421b
cudnn activations support ND-Tensor (#116)
* refine TensorObj::getStride

* ActivationCudnn supports ND-Tensor
2023-08-22 14:21:59 +08:00
constroy Li 48847958d0
impl sqrt on CUDA (#109)
* impl sqrt on CUDA
fix parser of Gather and ReduceMean

* fix test_gather

* fix test_cuda_gather

* impl sqrt cpu and add sqrt to test_cuda_unary

* cuda_unary supports arbitary shapes

* fix SplitOp with dim=-1

* fix SplitOp with dim=-1
2023-08-18 12:17:47 +08:00
zhangyunze ef672894d0
support mixed dtype (#102)
* feat: support mixed dtype

* feat: support cast op

* test: add test for cast op

* feat: support datatype BFloat16

* feat: support data convert fp32 <-> bfp16

* fix: fix all op's infershape func

* fix as review comment
2023-08-16 21:49:43 +08:00
kilinchange 0dc5347089
memory_allocator (#103)
* - add LazyAllocator class
- calculate memory consumption at present

* - basic function of lazy_allocator, remaining test

* - modify LazyAllocator

* - modify InfiniTensor to fit LazyAllocator

* - add setDataBlob
- modify alignment
- fix GraphObj::dataMalloc

* - modified alignment value(64bytes -> 8bytes)
- fix LazyAllocator::getPtr()
- some dubug codes and commonts
- do alignment by chaning size instead of tailAddr

* - fix some problem

* - translate chinese comments to english

* - format codes

* - fix test

* - code format

* - modify codes as YdrMaser and bitzyz suggested

* - code format

* - modify codes as constroy suggested

* - codes format

* - modify alignment on cuda

* - code format

* - add test_lazy_allocator
- fix tests where not add input tensor into graph.tensors
- fix tests where init tensor's data before calling graph->dataMallocate()

* - code format

* - remove gpu runtime in test_lazy_allocator

* - fix test_lazy_allocator: remove cuda include

* - add test

* - code format

* - add ifdef for test of allocator

* - code format

* - fix test: remove unused ifdef

* - fix bang test

* - code format

* Merge branch 'master' into dcj/memory_allocator

* fix: fix cuda conv_fp16 run fail

* fix bang_runtime.cc and cuda_runtime.cc

* - update mkl code

* - fix codes for mkl

* - code format

* - remove unused commented codes
- add an empty line at the end of the blob.cc

---------

Co-authored-by: zhangyunze <z13785159769@163.com>
2023-08-13 13:39:35 +08:00
zhangyunze bd9e1aeb3f
fix: fix cuda conv_fp16 run fail (#105) 2023-08-10 15:22:18 +08:00
Derui Yang 57ac94d893
refactor(core): 添加新的 `OpType` 定义 (#99)
* feat: 添加新的 OpType 定义

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* refactor: 使用新的 OpType 替换原来的,修改整个项目

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: onnx 导入

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: 修正 cuda 和 bang kernel 的问题

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: 过滤 bang test

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: 过滤 bang test

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix bang code.

* fix code on bang

* fmt

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: 删除指定文件

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* fix: 删两个没用的文件,去掉一个不知道为什么的注释

Signed-off-by: YdrMaster <ydrml@hotmail.com>

---------

Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
2023-08-07 11:17:05 +08:00
zhangyunze 9b10a74788
支持fp16 dtype (#96)
* add conv_half kernel

* Conv Kernel FP16

* dcj:
replace "DataType::Float32" with "op->getDType()" to support more DataType

* feat: support Float16 dtype

* fix: set default clang-format to 14 version

* fix: 按照review意见修改

* fix: add data convert to convfp16 kernel test

* test: add conv_fp16 kernel test

---------

Co-authored-by: zhangyue207 <zhangyue@qiyuanlab.com>
Co-authored-by: kilinchange <kilinchange@163.com>
2023-08-02 16:38:16 +08:00