* [feature] support kvcache with static graph
* use workspace to optimize kvcache attention
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: SqueezeOp lift the dependency of onnx infershape.
* feat: UnsqueezeOp lift the dependency of onnx infershape.
* feat: lift the dependency of onnx infershape
* fix: fix Makefile off nccl
* MLU CNCL base
* add FindCNCL.cmake, not find -lcncl
* bangPrintFloat not find
* docker:make sucessful, test error
* delete net file and onnxtest.py
* init
* fix cncl
* format
* fix
* format
* fix cncl
* run dist gpt2 on mlu
* format
* fix import error on mlu docker
* run llama single card
* run distributed llama2
* add test for slice/reduce on mlu
* fix cncl related test
* fix format
* format
* delete comments
* change GPU to MLU
* MLU CNCL base
* add FindCNCL.cmake, not find -lcncl
* bangPrintFloat not find
* docker:make sucessful, test error
* delete net file and onnxtest.py
* init
* fix cncl
* format
* fix
* format
* fix cncl
* run dist gpt2 on mlu
* format
* fix import error on mlu docker
* run llama single card
* run distributed llama2
* add test for slice/reduce on mlu
* fix cncl related test
* fix format
* format
* delete comments
* change GPU to MLU
* modify launch script
* fix name
* fix format
* fix gather
* format python script
---------
Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
* - add layernorm kernel
* success:add layernorm kernel and test
* fix: remove unusalble comments
* fix: modify code as reviewer suggested
* debug,modified .cu and test
* optional bias support
* overloading function
* fix bug after merging; remove time constrain in conv test
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add reduceSum op and kernel
* fix merge and format
* Reduce: reuse cat macro, add doc string
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support dynamic tensor part1
* feat: support dynamic-tensor part2
* feat: support dynamic tensor part 3
* fix: fix some ..
* - add kvcache example
* feat: support concat to identity kernel
* add a simple mempory pool for allocator
* fix: rebase to master
* fix bug after merging
* - remove outdated script
* fix: fix as review
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* [feature] add fused attention_kvcache operator support
* add test to attention_kvcache op
* Add space line at EOF
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* "add softmax.cu,.cc,.h"
* Modify cuda softmax
* "modified the introduction of softmax.cu"
* "add format of cuda_softmax.h"
* "modified where.cc(.cu,.h) and softmax.cu"
* "modified format"
* Fix cpu softmax kernel
* "modified the // introduction of softmax.cu"
* "modified softmax.cu and use 1D block"
* "modified softmax.cu,format, and use 1D block"
* "introduce share mem to speed softmax"
* "reduce the input of function"
* modified the format
* remodify 2D block softmax
* remodify 1D block softmax
* modified the share memory
* add warp reduce
* conflict solve two
* remove extra space line
* solve comment
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
* test: 支持编译 einnet 单元测试,但不是所有测试都能通过
Signed-off-by: YdrMaster <ydrml@hotmail.com>
* Fix: locating resource files and skip codegen
- Change the path parameters in `matchExprResult` and `checkExprLogSame` to paths relative to the project home
- Skip NNetMemboundOp tests as they require codegen
---------
Signed-off-by: YdrMaster <ydrml@hotmail.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* support kunlun xpu and add an operator named Add
* add sub, mul, div, pow, maximum, minimum
* add code
* add xpu code
* add code
* add matmul
* add transpose
* add unary operator
* add unary operator
* add some operator
* add code
* support run resnet18 on xpu
* add code
* add max pool2d
* fix xpu code, let it can run.
* 添加XPU算子 (#120)
* add floordiv for xpu
* add batchnorm for xpu
* add more cast types for xpu
* add conv_trans for xpu
* add pad for xpu
* add logical ops for xpu
* fix format for xpu src and include
* fix format for xpu test
* fix format for xpu src
---------
Co-authored-by: Bolun <bolunz@u.nus.edu>
* Xpu abs (#121)
* add: unary kernel for xpu
* formatting
* format
* format
* format
* fix: pointer jump
* fix optype comments
* fix bug introduced while resolving conflict
* change cmake option for kunlunxin xpu from 'xpu' to 'kunlun'; fix bug after merging distributed infrastructure
* Add doc support for xpu (#141)
* fix
* fix
* fix pooling test
* format
* format
* fix
* fix
* set cmake version requirement
* fix cmakelists
* rename xpu to kunlun
* fix
* fix format
* fix format
* fix format
* fix change name to kunlun
* format
* fix format
* clang format
* fix format
---------
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: zhangyue207 <138768300+zhangyue207@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: baominghelly <41820386+baominghelly@users.noreply.github.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
* Add GatherElements op and cuda kernel
* fix format
* remove print
* remove unused var
* fix spacing
* fix format
---------
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add ceil mode for pooling
* do not print debug info for allocator by default
* fix test bugs after introducing pooling ceil mode
* fix onnx import bug