* - add layernorm kernel
* success:add layernorm kernel and test
* fix: remove unusalble comments
* fix: modify code as reviewer suggested
* debug,modified .cu and test
* optional bias support
* overloading function
* fix bug after merging; remove time constrain in conv test
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add reduceSum op and kernel
* fix merge and format
* Reduce: reuse cat macro, add doc string
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support dynamic tensor part1
* feat: support dynamic-tensor part2
* feat: support dynamic tensor part 3
* fix: fix some ..
* - add kvcache example
* feat: support concat to identity kernel
* add a simple mempory pool for allocator
* fix: rebase to master
* fix bug after merging
* - remove outdated script
* fix: fix as review
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* [feature] add fused attention_kvcache operator support
* add test to attention_kvcache op
* Add space line at EOF
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* "add softmax.cu,.cc,.h"
* Modify cuda softmax
* "modified the introduction of softmax.cu"
* "add format of cuda_softmax.h"
* "modified where.cc(.cu,.h) and softmax.cu"
* "modified format"
* Fix cpu softmax kernel
* "modified the // introduction of softmax.cu"
* "modified softmax.cu and use 1D block"
* "modified softmax.cu,format, and use 1D block"
* "introduce share mem to speed softmax"
* "reduce the input of function"
* modified the format
* remodify 2D block softmax
* remodify 1D block softmax
* modified the share memory
* add warp reduce
* conflict solve two
* remove extra space line
* solve comment
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
* Add GatherElements op and cuda kernel
* fix format
* remove print
* remove unused var
* fix spacing
* fix format
---------
Co-authored-by: panzezhong@qiyuanlab.com <panzezhong@zezhongpan>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add ceil mode for pooling
* do not print debug info for allocator by default
* fix test bugs after introducing pooling ceil mode
* fix onnx import bug
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* Add broadcast operator and cuda kernel
* Add comments for operators
* remove const of class member
* move communicator to CudaRuntimeObj
* Add an empty line at EOF.
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support to sqrt op
* feat: support to erf op
* feat: support to expand op
* feat: support to where op
* fix: gather op index can be int64_t(hard coding)
* fix: some wrong use
* style: fix the format style
* test: add test for change op
* fix: rebase to master
* fix: fix matmul b compute wrong
* add expand and where kernel
* Add int64 support for cuda gather kernel
* add test_where.cc
* add "expand.(cu/cc,test,cuda),modified where.cu"
* Separate initialization of datatypes to avoid compile error
* modify where.(cu/cc/h,test), expand and clip
* Format fix
* Format fix
---------
Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add cuda transpose kernel
* Empty line cuda_transpose.h
* Empty line small_array.h
* empty line transpose.cc
* empty line transpose.cu
* empty line test_cuda_transpose.cc
* impl sqrt on CUDA
fix parser of Gather and ReduceMean
* fix test_gather
* fix test_cuda_gather
* impl sqrt cpu and add sqrt to test_cuda_unary
* cuda_unary supports arbitary shapes
* fix SplitOp with dim=-1
* fix SplitOp with dim=-1
* - add LazyAllocator class
- calculate memory consumption at present
* - basic function of lazy_allocator, remaining test
* - modify LazyAllocator
* - modify InfiniTensor to fit LazyAllocator
* - add setDataBlob
- modify alignment
- fix GraphObj::dataMalloc
* - modified alignment value(64bytes -> 8bytes)
- fix LazyAllocator::getPtr()
- some dubug codes and commonts
- do alignment by chaning size instead of tailAddr
* - fix some problem
* - translate chinese comments to english
* - format codes
* - fix test
* - code format
* - modify codes as YdrMaser and bitzyz suggested
* - code format
* - modify codes as constroy suggested
* - codes format
* - modify alignment on cuda
* - code format
* - add test_lazy_allocator
- fix tests where not add input tensor into graph.tensors
- fix tests where init tensor's data before calling graph->dataMallocate()
* - code format
* - remove gpu runtime in test_lazy_allocator
* - fix test_lazy_allocator: remove cuda include
* - add test
* - code format
* - add ifdef for test of allocator
* - code format
* - fix test: remove unused ifdef
* - fix bang test
* - code format
* Merge branch 'master' into dcj/memory_allocator
* fix: fix cuda conv_fp16 run fail
* fix bang_runtime.cc and cuda_runtime.cc
* - update mkl code
* - fix codes for mkl
* - code format
* - remove unused commented codes
- add an empty line at the end of the blob.cc
---------
Co-authored-by: zhangyunze <z13785159769@163.com>
* add conv_half kernel
* Conv Kernel FP16
* dcj:
replace "DataType::Float32" with "op->getDType()" to support more DataType
* feat: support Float16 dtype
* fix: set default clang-format to 14 version
* fix: 按照review意见修改
* fix: add data convert to convfp16 kernel test
* test: add conv_fp16 kernel test
---------
Co-authored-by: zhangyue207 <zhangyue@qiyuanlab.com>
Co-authored-by: kilinchange <kilinchange@163.com>
fix review
change Device::MKL to Device::INTELCPU
fix mkl linkage
fix errors according to merge from master
now can call mkl backend
fix softmax/flatten with axis from onnx.
modify README.md
fix memory refree
add env_lotus_intelcpu.sh
fix compile
merge from branch cpu_backend
fix something add gather
fix something
FIX: directory rename from "mkl" to "intelcpu"
ADD: use oneMKL dpcpp interface to implement matmul kernel.
ADD: add dpcpp as compiler for mkl, and fix warnings for clang compiling.
add dpcpp kernel for pow.
ADD: mkl kernel for pad.
ADD: slice mkl kernel.
ADD: reshape/flatten/identity mkl kernel.
ADD: split mkl kernel.
fix compile error
FIX: fix flattenObj with axis.
ADD reduce_mean mkl kernel.
Add concat mkl kernel.
bathNorm for mkl kernel.
sigmoid mkl kernel.
ADD:add mkl kernel for pooling
add more tests for softmax
Now softmax cuda kernel supports any axises.
mkl kernel for softmax
softmax
add axis to softmax operator
add mkl kernel for abs tanh
ADD: relu kernel for mkl
fix binary mkl primitives.
add mkl kernel for binary operators
fix compiler error
move stream to runtime
clang format
add MemoryFormat for tensorObj.
use post_ops for fused conv/deconv
Distinguish mkl op_timer from cuda op timer.
add act optype to conv and deconv
add operator timer
add mkl kernel for convTransposed
minor fix for group conv
do not use cblas_sgemm_batch
CpuRuntimeObj->NativeCpuRuntimeObj
add matmul op for mkl
* move memory format transformation to TensorObj
clang format
add MemoryFormat for tensorObj.
use post_ops for fused conv/deconv
Distinguish mkl op_timer from cuda op timer.
add act optype to conv and deconv
add operator timer
add mkl kernel for convTransposed
minor fix for group conv
do not use cblas_sgemm_batch
CpuRuntimeObj->NativeCpuRuntimeObj
add matmul op for mkl
* fix: fix bugs when rebasing from master
fix: fix bugs when rebasing from master
* fix: update api after rebasing
* fix: fix format; fix onnx import
* fix: fix clang-format
* [fix] fix conv_transpose test
* [fix] use stronger test case for transposed conv
* [fix] remove tensor memory format; fix mkl transpose conv
* [fix] add FIXME tag for op_timer python api
---------
Co-authored-by: whjthu <haojie0429@gmail.com>
ADD: resize operator and cuda kernel,support nearest/linear coef.
fix some
fix tests
add more tests for linear mode.
add linear coef mode.
add scales
add tests
fix tests.
add notLarger notSmaller
fix
add test
ADD:resize operator and cuda kernel
fix numInputs of batchNorm, add new line in file ending.
ADD: batch norm operator and cuda kernel.
add training
remove comments.
fix compile error.
add batch norm operator and cuda kernel.
* ADD:concat/split operator and cuda kernels
refector
minor change comment
ADD:concat/split operator and cuda kernels
merge split_kernel and concat_kernel to split_concat_kernel.
Revert "fix"
This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c.
fix
ADD:concat/split operator and cuda kernels
change whole tensor name to composed tensor
fix some
remove unused header.
rebase
add CudaKernel
add test for split.
ADD split operator and cuda kernel.
modify test.
ADD:concat operator and cuda kernel.
ADD:concat/split operator and cuda kernels
fix some
remove unused header.
rebase
add CudaKernel
ADD:concat/split operator and cuda kernels
add test for split.
ADD split operator and cuda kernel.
modify test.
ADD:concat operator and cuda kernel.
* remove extra comment; typo fix.
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* ADD:reshape/flatten/identity operators and cuda kernel.
fix: use cudaMemcpyAsync
clang format.
ADD flatten/identity operator.
add test for reshape.
ADD: reshape operator and cuda kernel.
* Fix: seperate CUDA tests & remove old header
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>