Commit Graph

149 Commits

Author SHA1 Message Date
kilinchange dc6befb549 fix: fix re-dataMalloc for weight tensor and use of naive allocator 2023-12-29 17:27:36 +08:00
OdinaryWord 8ae5958b29 feat: add support for dynamic_quantize_linear 2023-12-19 16:41:17 +08:00
kilinchange 0e75f99e7e feat: add dynamic quantize linear kernel 2023-12-19 14:46:14 +08:00
zhangyunze 97e3377ca5 feat: add matmulinteger op 2023-12-19 14:14:33 +08:00
xgqdut2016 9c82936386 add half and float dequantizeLinear 2023-12-18 17:47:53 +08:00
kilinchange c63ed4326d feat: add frontend DynamicQuantizeLinear and DequantizeLinear kernels 2023-12-18 13:58:20 +08:00
xgqdut2016 f51ce3231a where support int8 2023-12-15 15:40:30 +08:00
xgqdut2016 c859e655d3 Merge branch 'master' into support_fp16 2023-12-15 10:02:03 +08:00
xgqdut2016 a3929c25f8
Add send and recv operators based on NCCL (#182)
* baseline sendrecv, bug

* success sendrecv

* get rank from comm

* set output shape

* successful:set output shape equal to input shape

* shape as attribute

* success:shape as attribute

* success send recv, output 0

* add onnx test

* split send and recv

* success split send and recv

* test-onnx bug

* success test-onnx

* modified onnx.py

* solve review
2023-12-14 16:38:03 +08:00
OdinaryWord db8c3eec15 style: fix style 2023-12-14 11:32:07 +08:00
OdinaryWord c29dcf1e6d add cuda cast & support half-precision for gather 2023-12-14 11:24:25 +08:00
zhangyunze bdb8d8d65f feat: support matmulOp/expandOp fp16 2023-12-14 11:07:45 +08:00
kilinchange 5af7f1e753 - unary support fp16 2023-12-13 17:06:27 +08:00
zhangyunze d5e775397d feat: support transpose fp16 2023-12-13 16:36:37 +08:00
kilinchange 4b02de7e17 - element_wise support fp16 2023-12-13 15:57:25 +08:00
xgqdut2016 dd4a90fb5e add split_concat fp16 2023-12-11 16:45:16 +08:00
xgqdut2016 fda0a5f982 add layernorm fp16 2023-12-11 15:05:34 +08:00
xgqdut2016 8b2e3b8e19 add where fp16 2023-12-08 16:57:49 +08:00
xgqdut2016 a000cb0304 modified all register kernel 2023-12-07 17:53:28 +08:00
kilinchange 4db6699e09 - Remove dataType from the kernel registration. 2023-11-30 13:51:24 +08:00
Hardy 3ead20a23a
Fix workspace & bang conv (#183)
* fix bang workspace

* fix convbpdata

* fix code

* add code

* fix

* fix

* fix conv

* fix test conv

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 15:16:25 +08:00
xgqdut2016 a7293c12ba
Add layer normalization (#181)
* - add layernorm kernel

* success:add layernorm kernel and test

* fix: remove unusalble comments

* fix: modify code as reviewer suggested

* debug,modified .cu and test

* optional bias support

* overloading function

* fix bug after merging; remove time constrain in conv test

---------

Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 15:15:14 +08:00
PanZezhong1725 6ece3f4a77
Add ReduceSum op and kernel (#160)
* Add reduceSum op and kernel

* fix merge and format

* Reduce: reuse cat macro, add doc string

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 09:29:58 +08:00
xgqdut2016 595a9906d2
add infer index function (#175)
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-24 09:24:25 +08:00
zhangyunze 331f7ab2b8
support Dynamic tensor infer shape and fix memory pool (#176)
* feat: support dynamic tensor part1

* feat: support dynamic-tensor part2

* feat: support dynamic tensor part 3

* fix: fix some ..

* - add kvcache example

* feat: support concat to identity kernel

* add a simple mempory pool for allocator

* fix: rebase to master

* fix bug after merging

* - remove outdated script

* fix: fix as review

---------

Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-23 13:11:50 +08:00
xiaonans 965df4e294
[feature] add fused attention_kvcache operator support (#179)
* [feature] add fused attention_kvcache operator support

* add test to attention_kvcache op

* Add space line at EOF

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-14 23:44:22 +08:00
Hardy 50862df765
[Kunlun & CUDA & BANG] add depth2space operator (#178)
* add depth2space operator

* fix format

* add depth2space on cambricon bang

* add depth2space on gpu

---------

Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-11-10 17:58:26 +08:00
xgqdut2016 d3e7543291
Cuda softmax (#129)
* "add softmax.cu,.cc,.h"

* Modify cuda softmax

* "modified the introduction of softmax.cu"

* "add format of cuda_softmax.h"

* "modified where.cc(.cu,.h) and softmax.cu"

* "modified format"

* Fix cpu softmax kernel

* "modified the // introduction of softmax.cu"

* "modified softmax.cu and use 1D block"

* "modified softmax.cu,format, and use 1D block"

* "introduce share mem to speed softmax"

* "reduce the input of function"

* modified the format

* remodify 2D block softmax

* remodify 1D block softmax

* modified the share memory

* add warp reduce

* conflict solve two

* remove extra space line

* solve comment

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
2023-11-06 08:56:23 +08:00
Derui Yang 1a6fccccbe
test: 支持编译 einnet 单元测试,但不是所有测试都能通过 (#174)
* test: 支持编译 einnet 单元测试,但不是所有测试都能通过

Signed-off-by: YdrMaster <ydrml@hotmail.com>

* Fix: locating resource files and skip codegen

- Change the path parameters in `matchExprResult` and `checkExprLogSame` to paths relative to the project home
- Skip NNetMemboundOp tests as they require codegen

---------

Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2023-11-03 13:21:49 +08:00
xgqdut2016 ec3adf6fa7
support 8D tensor, add test example (#170)
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-31 10:47:36 +08:00
constroy Li feccd4f318
fix tensor parallel for llama (#159)
* fix Slice

* change default rounds of timeit to 10 to reduce time

* fix slice with large ends

* Reshape support Int64

* support position_ids as input

* skip last MatMul in Llama

* skip infer_shapes to parse large model

* update launch.py

* fix split_concat_kernel

* print more message in launch.py

* Reshape supports both Int32 and Int64

* try infer_shapes and warn about failure

* fix format

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-30 15:04:16 +08:00
Hardy 1184fa131f
Xpu (#82)
* support kunlun xpu and add an operator named Add

* add sub, mul, div, pow, maximum, minimum

* add code

* add xpu code

* add code

* add matmul

* add transpose

* add unary operator

* add unary operator

* add some operator

* add code

* support run resnet18 on xpu

* add code

* add max pool2d

* fix xpu code, let it can run.

* 添加XPU算子 (#120)

* add floordiv for xpu

* add batchnorm for xpu

* add more cast types for xpu

* add conv_trans for xpu

* add pad for xpu

* add logical ops for xpu

* fix format for xpu src and include

* fix format for xpu test

* fix format for xpu src

---------

Co-authored-by: Bolun <bolunz@u.nus.edu>

* Xpu abs (#121)

* add: unary kernel for xpu

* formatting

* format

* format

* format

* fix: pointer jump

* fix optype comments

* fix bug introduced while resolving conflict

* change cmake option for kunlunxin xpu from 'xpu' to 'kunlun'; fix bug after merging distributed infrastructure

* Add doc support for xpu (#141)

* fix

* fix

* fix pooling test

* format

* format

* fix

* fix

* set cmake version requirement

* fix cmakelists

* rename xpu to kunlun

* fix

* fix format

* fix format

* fix format

* fix change name to kunlun

* format

* fix format

* clang format

* fix format

---------

Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: zhangyue207 <138768300+zhangyue207@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: baominghelly <41820386+baominghelly@users.noreply.github.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
2023-10-16 10:57:08 +08:00
Haojie Wang 8e4d88fb9f
add transpose, concat and split for native cpu (#158) 2023-10-12 10:14:28 +08:00
PanZezhong1725 36ae7b7fb6
Add GatherElements op and cuda kernel (#149)
* Add GatherElements op and cuda kernel

* fix format

* remove print

* remove unused var

* fix spacing

* fix format

---------

Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-12 09:18:12 +08:00
PanZezhong1725 ed3034f878
Add HardSigmoid and HardSwish (#156)
* Add HardSigmoid and HardSwish

* fix format
2023-10-10 22:41:06 +08:00
kilinchange 1151101fb9
add naive allocator for debugging (#140)
* add naive allocator only for debugging

* merge redundant api

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-10 16:42:23 +08:00
ChengXiang Qi 7f16fa353e
【Hackathon No.108】Add Gelu operator, ffi, kernel for cpu and gpu. (#148)
feat: Add Gelu kernel, operator, ffi.
2023-10-10 15:21:13 +08:00
PanZezhong1725 7600fe688c
Add Neg operator and kernel (#152)
* Add Neg operator and kernel

* handle neg in to_onnx

---------

Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-10-10 10:54:56 +08:00
Haojie Wang 7a9fcd93b2
Pooling ceil mode (#155)
* add ceil mode for pooling

* do not print debug info for allocator by default

* fix test bugs after introducing pooling ceil mode

* fix onnx import bug
2023-10-09 20:51:39 +08:00
PanZezhong1725 785853b0a3
Add erf kernel for cpu and gpu (#147)
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
2023-10-09 09:36:55 +08:00
Haojie Wang 8f2597a508
fix bang runtime bug after merging distributed branch (#137) 2023-09-19 14:10:39 +08:00
kilinchange 48ec730579
Support kvcache (#134)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* export min/max to python

* fix MatMul

* modify launch.py to run opt

* hack to treat ReduceSum as AllReduceSum

* throw exception in cuda error

* fix parallel_opt.py

* improve the error prompt and cuda error check

* fix GatherObj::GatherObj member init

* fix size calculation for scalar (rank = 0) tensor

* MatMul supports bias

* fix add bias for row parallel gemm

* add --gen_std to launch.py

* fix AllReduceNCCL

* update launch.py

* less log

* update parallel_opt

* update launch.py

* add __eq__ for Placement sub-classes

* less benchmark run

* fix placement infer for matmul

* fix vacabuary size

* fix Exception

* Add shard tensor with group to support gpt2

* Add find successor function to find split op at different depth

* recover CommunicatorObj

* improve error mesasge

* optimize parallel_opt.py

* optimize launch.py

* recover docs for all_reduce and all_gather

* - support concat for kvcache

* - modify allocator

* - add tensorType
- modify allocator to support memory allocation based on tensorType

* - fix allocator init

* - support kvcache by running 2 stub distributively

* - fix name

* - remove unused flag

* - fix wrong pb name

* - fix as constroy suggessed

* - fix launch.py format

---------

Co-authored-by: constroy <constroy.li@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
2023-09-18 14:17:02 +08:00
PanZezhong1725 c6b82cfda0
Copyout numpy接口 (#135)
* Add copy out numpy interface, delete returning buffer directly, add api test

* Add dtype interface
2023-09-15 16:40:44 +08:00
constroy Li 4c321c8a91
tensor parallel for transformer (#125)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* export min/max to python

* fix MatMul

* modify launch.py to run opt

* hack to treat ReduceSum as AllReduceSum

* throw exception in cuda error

* fix parallel_opt.py

* improve the error prompt and cuda error check

* fix GatherObj::GatherObj member init

* fix size calculation for scalar (rank = 0) tensor

* MatMul supports bias

* fix add bias for row parallel gemm

* add --gen_std to launch.py

* fix AllReduceNCCL

* update launch.py

* less log

* update parallel_opt

* update launch.py

* add __eq__ for Placement sub-classes

* less benchmark run

* fix placement infer for matmul

* fix vacabuary size

* fix Exception

* Add shard tensor with group to support gpt2

* Add find successor function to find split op at different depth

* recover CommunicatorObj

* improve error mesasge

* optimize parallel_opt.py

* optimize launch.py

* recover docs for all_reduce and all_gather

* Fix API

* fix format

---------

Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-14 14:19:45 +08:00
xgqdut2016 dda668fd16
"modified where" (#131)
* "modified where"

* "adapt int or bool condition datatype"

* "add broadcast_shape.h,error"

* add broadcast.h

* "modified broadcast_shape.h and where.cc,.cu"
2023-09-14 10:45:57 +08:00
constroy Li f60767a770
impl distributed launch with NCCL (#106)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* Add broadcast operator and cuda kernel

* Add comments for operators

* remove const of class member

* move communicator to CudaRuntimeObj

* Add an empty line at EOF.

---------

Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-05 09:47:35 +08:00
PanZezhong1725 2412c25e67
Issue 107: Add copyin Numpy and covertion to Numpy (#126)
* Add copyin_numpy and to_numpy for pybind TensorObj

* fix copyin size assertion

* fix size calculation for scalar (rank = 0) tensor

* Use pybind buffer instead of returning array

* fix format
2023-09-01 11:20:26 +08:00
zhangyunze 3e6ef305f1
框架支持bert/gpt2模型构图 (#94)
* feat: support to sqrt op

* feat: support to erf op

* feat: support to expand op

* feat: support to where op

* fix: gather op index can be int64_t(hard coding)

* fix: some wrong use

* style: fix the format style

* test: add test for change op

* fix: rebase to master

* fix: fix matmul b compute wrong

* add expand and where kernel

* Add int64 support for cuda gather kernel

* add test_where.cc

* add "expand.(cu/cc,test,cuda),modified where.cu"

* Separate initialization of datatypes to avoid compile error

* modify where.(cu/cc/h,test), expand and clip

* Format fix

* Format fix

---------

Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-08-29 16:06:52 +08:00
constroy Li 1e91979c76
add CUDNN impl for Min and Max (#118)
* add cudnn impl for Min and Max

* fix onnx _search_shape with output shape
2023-08-22 16:19:29 +08:00
PanZezhong1725 9cf6c30e1c
Add cuda transpose kernel (#115)
* Add cuda transpose kernel

* Empty line cuda_transpose.h

* Empty line small_array.h

* empty line transpose.cc

* empty line transpose.cu

* empty line test_cuda_transpose.cc
2023-08-22 14:22:15 +08:00