* [feature] support kvcache with static graph
* use workspace to optimize kvcache attention
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* support more data type
* clang format
* fix little bug
* fix cncl datatype
* fix format
---------
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Zhang Bolun <Chamberlain0w0@gmail.com>
* feat: SqueezeOp lift the dependency of onnx infershape.
* feat: UnsqueezeOp lift the dependency of onnx infershape.
* feat: lift the dependency of onnx infershape
* fix: fix Makefile off nccl
* MLU CNCL base
* add FindCNCL.cmake, not find -lcncl
* bangPrintFloat not find
* docker:make sucessful, test error
* delete net file and onnxtest.py
* init
* fix cncl
* format
* fix
* format
* fix cncl
* run dist gpt2 on mlu
* format
* fix import error on mlu docker
* run llama single card
* run distributed llama2
* add test for slice/reduce on mlu
* fix cncl related test
* fix format
* format
* delete comments
* change GPU to MLU
* MLU CNCL base
* add FindCNCL.cmake, not find -lcncl
* bangPrintFloat not find
* docker:make sucessful, test error
* delete net file and onnxtest.py
* init
* fix cncl
* format
* fix
* format
* fix cncl
* run dist gpt2 on mlu
* format
* fix import error on mlu docker
* run llama single card
* run distributed llama2
* add test for slice/reduce on mlu
* fix cncl related test
* fix format
* format
* delete comments
* change GPU to MLU
* modify launch script
* fix name
* fix format
* fix gather
* format python script
---------
Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
* - add layernorm kernel
* success:add layernorm kernel and test
* fix: remove unusalble comments
* fix: modify code as reviewer suggested
* debug,modified .cu and test
* optional bias support
* overloading function
* fix bug after merging; remove time constrain in conv test
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add reduceSum op and kernel
* fix merge and format
* Reduce: reuse cat macro, add doc string
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support dynamic tensor part1
* feat: support dynamic-tensor part2
* feat: support dynamic tensor part 3
* fix: fix some ..
* - add kvcache example
* feat: support concat to identity kernel
* add a simple mempory pool for allocator
* fix: rebase to master
* fix bug after merging
* - remove outdated script
* fix: fix as review
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* [feature] add fused attention_kvcache operator support
* add test to attention_kvcache op
* Add space line at EOF
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* "add softmax.cu,.cc,.h"
* Modify cuda softmax
* "modified the introduction of softmax.cu"
* "add format of cuda_softmax.h"
* "modified where.cc(.cu,.h) and softmax.cu"
* "modified format"
* Fix cpu softmax kernel
* "modified the // introduction of softmax.cu"
* "modified softmax.cu and use 1D block"
* "modified softmax.cu,format, and use 1D block"
* "introduce share mem to speed softmax"
* "reduce the input of function"
* modified the format
* remodify 2D block softmax
* remodify 1D block softmax
* modified the share memory
* add warp reduce
* conflict solve two
* remove extra space line
* solve comment
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
* test: 支持编译 einnet 单元测试,但不是所有测试都能通过
Signed-off-by: YdrMaster <ydrml@hotmail.com>
* Fix: locating resource files and skip codegen
- Change the path parameters in `matchExprResult` and `checkExprLogSame` to paths relative to the project home
- Skip NNetMemboundOp tests as they require codegen
---------
Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* fix Slice
* change default rounds of timeit to 10 to reduce time
* fix slice with large ends
* Reshape support Int64
* support position_ids as input
* skip last MatMul in Llama
* skip infer_shapes to parse large model
* update launch.py
* fix split_concat_kernel
* print more message in launch.py
* Reshape supports both Int32 and Int64
* try infer_shapes and warn about failure
* fix format
---------
Co-authored-by: whjthu <haojie0429@gmail.com>
* support kunlun xpu and add an operator named Add
* add sub, mul, div, pow, maximum, minimum
* add code
* add xpu code
* add code
* add matmul
* add transpose
* add unary operator
* add unary operator
* add some operator
* add code
* support run resnet18 on xpu
* add code
* add max pool2d
* fix xpu code, let it can run.
* 添加XPU算子 (#120)
* add floordiv for xpu
* add batchnorm for xpu
* add more cast types for xpu
* add conv_trans for xpu
* add pad for xpu
* add logical ops for xpu
* fix format for xpu src and include
* fix format for xpu test
* fix format for xpu src
---------
Co-authored-by: Bolun <bolunz@u.nus.edu>
* Xpu abs (#121)
* add: unary kernel for xpu
* formatting
* format
* format
* format
* fix: pointer jump
* fix optype comments
* fix bug introduced while resolving conflict
* change cmake option for kunlunxin xpu from 'xpu' to 'kunlun'; fix bug after merging distributed infrastructure
* Add doc support for xpu (#141)
* fix
* fix
* fix pooling test
* format
* format
* fix
* fix
* set cmake version requirement
* fix cmakelists
* rename xpu to kunlun
* fix
* fix format
* fix format
* fix format
* fix change name to kunlun
* format
* fix format
* clang format
* fix format
---------
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: zhangyue207 <138768300+zhangyue207@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: baominghelly <41820386+baominghelly@users.noreply.github.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
* Add GatherElements op and cuda kernel
* fix format
* remove print
* remove unused var
* fix spacing
* fix format
---------
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add ceil mode for pooling
* do not print debug info for allocator by default
* fix test bugs after introducing pooling ceil mode
* fix onnx import bug
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* export min/max to python
* fix MatMul
* modify launch.py to run opt
* hack to treat ReduceSum as AllReduceSum
* throw exception in cuda error
* fix parallel_opt.py
* improve the error prompt and cuda error check
* fix GatherObj::GatherObj member init
* fix size calculation for scalar (rank = 0) tensor
* MatMul supports bias
* fix add bias for row parallel gemm
* add --gen_std to launch.py
* fix AllReduceNCCL
* update launch.py
* less log
* update parallel_opt
* update launch.py
* add __eq__ for Placement sub-classes
* less benchmark run
* fix placement infer for matmul
* fix vacabuary size
* fix Exception
* Add shard tensor with group to support gpt2
* Add find successor function to find split op at different depth
* recover CommunicatorObj
* improve error mesasge
* optimize parallel_opt.py
* optimize launch.py
* recover docs for all_reduce and all_gather
* - support concat for kvcache
* - modify allocator
* - add tensorType
- modify allocator to support memory allocation based on tensorType
* - fix allocator init
* - support kvcache by running 2 stub distributively
* - fix name
* - remove unused flag
* - fix wrong pb name
* - fix as constroy suggessed
* - fix launch.py format
---------
Co-authored-by: constroy <constroy.li@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* export min/max to python
* fix MatMul
* modify launch.py to run opt
* hack to treat ReduceSum as AllReduceSum
* throw exception in cuda error
* fix parallel_opt.py
* improve the error prompt and cuda error check
* fix GatherObj::GatherObj member init
* fix size calculation for scalar (rank = 0) tensor
* MatMul supports bias
* fix add bias for row parallel gemm
* add --gen_std to launch.py
* fix AllReduceNCCL
* update launch.py
* less log
* update parallel_opt
* update launch.py
* add __eq__ for Placement sub-classes
* less benchmark run
* fix placement infer for matmul
* fix vacabuary size
* fix Exception
* Add shard tensor with group to support gpt2
* Add find successor function to find split op at different depth
* recover CommunicatorObj
* improve error mesasge
* optimize parallel_opt.py
* optimize launch.py
* recover docs for all_reduce and all_gather
* Fix API
* fix format
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* Add broadcast operator and cuda kernel
* Add comments for operators
* remove const of class member
* move communicator to CudaRuntimeObj
* Add an empty line at EOF.
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>