* [feature] add cudagraph support
* modify code to pass the cuda_all_reduce test
* modify rope op
* support rmsnorm
* add fp16 support to silu cuda op
* fix bugs in rmsnorm op
* uncomment simplify in onnx.py
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: SqueezeOp lift the dependency of onnx infershape.
* feat: UnsqueezeOp lift the dependency of onnx infershape.
* feat: lift the dependency of onnx infershape
* fix: fix Makefile off nccl
* - add layernorm kernel
* success:add layernorm kernel and test
* fix: remove unusalble comments
* fix: modify code as reviewer suggested
* debug,modified .cu and test
* optional bias support
* overloading function
* fix bug after merging; remove time constrain in conv test
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add reduceSum op and kernel
* fix merge and format
* Reduce: reuse cat macro, add doc string
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support dynamic tensor part1
* feat: support dynamic-tensor part2
* feat: support dynamic tensor part 3
* fix: fix some ..
* - add kvcache example
* feat: support concat to identity kernel
* add a simple mempory pool for allocator
* fix: rebase to master
* fix bug after merging
* - remove outdated script
* fix: fix as review
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* [feature] add fused attention_kvcache operator support
* add test to attention_kvcache op
* Add space line at EOF
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* fix Slice
* change default rounds of timeit to 10 to reduce time
* fix slice with large ends
* Reshape support Int64
* support position_ids as input
* skip last MatMul in Llama
* skip infer_shapes to parse large model
* update launch.py
* fix split_concat_kernel
* print more message in launch.py
* Reshape supports both Int32 and Int64
* try infer_shapes and warn about failure
* fix format
---------
Co-authored-by: whjthu <haojie0429@gmail.com>
* support kunlun xpu and add an operator named Add
* add sub, mul, div, pow, maximum, minimum
* add code
* add xpu code
* add code
* add matmul
* add transpose
* add unary operator
* add unary operator
* add some operator
* add code
* support run resnet18 on xpu
* add code
* add max pool2d
* fix xpu code, let it can run.
* 添加XPU算子 (#120)
* add floordiv for xpu
* add batchnorm for xpu
* add more cast types for xpu
* add conv_trans for xpu
* add pad for xpu
* add logical ops for xpu
* fix format for xpu src and include
* fix format for xpu test
* fix format for xpu src
---------
Co-authored-by: Bolun <bolunz@u.nus.edu>
* Xpu abs (#121)
* add: unary kernel for xpu
* formatting
* format
* format
* format
* fix: pointer jump
* fix optype comments
* fix bug introduced while resolving conflict
* change cmake option for kunlunxin xpu from 'xpu' to 'kunlun'; fix bug after merging distributed infrastructure
* Add doc support for xpu (#141)
* fix
* fix
* fix pooling test
* format
* format
* fix
* fix
* set cmake version requirement
* fix cmakelists
* rename xpu to kunlun
* fix
* fix format
* fix format
* fix format
* fix change name to kunlun
* format
* fix format
* clang format
* fix format
---------
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
Co-authored-by: Bolun Zhang <48948016+Chamberlain0w0@users.noreply.github.com>
Co-authored-by: Bolun <bolunz@u.nus.edu>
Co-authored-by: zhangyue207 <138768300+zhangyue207@users.noreply.github.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: baominghelly <41820386+baominghelly@users.noreply.github.com>
Co-authored-by: Bolun <chamberlain0w0@gmail.com>
* Add GatherElements op and cuda kernel
* fix format
* remove print
* remove unused var
* fix spacing
* fix format
---------
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add ceil mode for pooling
* do not print debug info for allocator by default
* fix test bugs after introducing pooling ceil mode
* fix onnx import bug
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* export min/max to python
* fix MatMul
* modify launch.py to run opt
* hack to treat ReduceSum as AllReduceSum
* throw exception in cuda error
* fix parallel_opt.py
* improve the error prompt and cuda error check
* fix GatherObj::GatherObj member init
* fix size calculation for scalar (rank = 0) tensor
* MatMul supports bias
* fix add bias for row parallel gemm
* add --gen_std to launch.py
* fix AllReduceNCCL
* update launch.py
* less log
* update parallel_opt
* update launch.py
* add __eq__ for Placement sub-classes
* less benchmark run
* fix placement infer for matmul
* fix vacabuary size
* fix Exception
* Add shard tensor with group to support gpt2
* Add find successor function to find split op at different depth
* recover CommunicatorObj
* improve error mesasge
* optimize parallel_opt.py
* optimize launch.py
* recover docs for all_reduce and all_gather
* - support concat for kvcache
* - modify allocator
* - add tensorType
- modify allocator to support memory allocation based on tensorType
* - fix allocator init
* - support kvcache by running 2 stub distributively
* - fix name
* - remove unused flag
* - fix wrong pb name
* - fix as constroy suggessed
* - fix launch.py format
---------
Co-authored-by: constroy <constroy.li@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* export min/max to python
* fix MatMul
* modify launch.py to run opt
* hack to treat ReduceSum as AllReduceSum
* throw exception in cuda error
* fix parallel_opt.py
* improve the error prompt and cuda error check
* fix GatherObj::GatherObj member init
* fix size calculation for scalar (rank = 0) tensor
* MatMul supports bias
* fix add bias for row parallel gemm
* add --gen_std to launch.py
* fix AllReduceNCCL
* update launch.py
* less log
* update parallel_opt
* update launch.py
* add __eq__ for Placement sub-classes
* less benchmark run
* fix placement infer for matmul
* fix vacabuary size
* fix Exception
* Add shard tensor with group to support gpt2
* Add find successor function to find split op at different depth
* recover CommunicatorObj
* improve error mesasge
* optimize parallel_opt.py
* optimize launch.py
* recover docs for all_reduce and all_gather
* Fix API
* fix format
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* Add broadcast operator and cuda kernel
* Add comments for operators
* remove const of class member
* move communicator to CudaRuntimeObj
* Add an empty line at EOF.
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support to sqrt op
* feat: support to erf op
* feat: support to expand op
* feat: support to where op
* fix: gather op index can be int64_t(hard coding)
* fix: some wrong use
* style: fix the format style
* test: add test for change op
* fix: rebase to master
* fix: fix matmul b compute wrong
* add expand and where kernel
* Add int64 support for cuda gather kernel
* add test_where.cc
* add "expand.(cu/cc,test,cuda),modified where.cu"
* Separate initialization of datatypes to avoid compile error
* modify where.(cu/cc/h,test), expand and clip
* Format fix
* Format fix
---------
Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* impl sqrt on CUDA
fix parser of Gather and ReduceMean
* fix test_gather
* fix test_cuda_gather
* impl sqrt cpu and add sqrt to test_cuda_unary
* cuda_unary supports arbitary shapes
* fix SplitOp with dim=-1
* fix SplitOp with dim=-1