* add ceil mode for pooling
* do not print debug info for allocator by default
* fix test bugs after introducing pooling ceil mode
* fix onnx import bug
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* Add broadcast operator and cuda kernel
* Add comments for operators
* remove const of class member
* move communicator to CudaRuntimeObj
* Add an empty line at EOF.
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* feat: support to sqrt op
* feat: support to erf op
* feat: support to expand op
* feat: support to where op
* fix: gather op index can be int64_t(hard coding)
* fix: some wrong use
* style: fix the format style
* test: add test for change op
* fix: rebase to master
* fix: fix matmul b compute wrong
* add expand and where kernel
* Add int64 support for cuda gather kernel
* add test_where.cc
* add "expand.(cu/cc,test,cuda),modified where.cu"
* Separate initialization of datatypes to avoid compile error
* modify where.(cu/cc/h,test), expand and clip
* Format fix
* Format fix
---------
Co-authored-by: xgqdut2016 <kenan_gewei@163.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add cuda transpose kernel
* Empty line cuda_transpose.h
* Empty line small_array.h
* empty line transpose.cc
* empty line transpose.cu
* empty line test_cuda_transpose.cc
* impl sqrt on CUDA
fix parser of Gather and ReduceMean
* fix test_gather
* fix test_cuda_gather
* impl sqrt cpu and add sqrt to test_cuda_unary
* cuda_unary supports arbitary shapes
* fix SplitOp with dim=-1
* fix SplitOp with dim=-1
* feat: support mixed dtype
* feat: support cast op
* test: add test for cast op
* feat: support datatype BFloat16
* feat: support data convert fp32 <-> bfp16
* fix: fix all op's infershape func
* fix as review comment
* - add LazyAllocator class
- calculate memory consumption at present
* - basic function of lazy_allocator, remaining test
* - modify LazyAllocator
* - modify InfiniTensor to fit LazyAllocator
* - add setDataBlob
- modify alignment
- fix GraphObj::dataMalloc
* - modified alignment value(64bytes -> 8bytes)
- fix LazyAllocator::getPtr()
- some dubug codes and commonts
- do alignment by chaning size instead of tailAddr
* - fix some problem
* - translate chinese comments to english
* - format codes
* - fix test
* - code format
* - modify codes as YdrMaser and bitzyz suggested
* - code format
* - modify codes as constroy suggested
* - codes format
* - modify alignment on cuda
* - code format
* - add test_lazy_allocator
- fix tests where not add input tensor into graph.tensors
- fix tests where init tensor's data before calling graph->dataMallocate()
* - code format
* - remove gpu runtime in test_lazy_allocator
* - fix test_lazy_allocator: remove cuda include
* - add test
* - code format
* - add ifdef for test of allocator
* - code format
* - fix test: remove unused ifdef
* - fix bang test
* - code format
* Merge branch 'master' into dcj/memory_allocator
* fix: fix cuda conv_fp16 run fail
* fix bang_runtime.cc and cuda_runtime.cc
* - update mkl code
* - fix codes for mkl
* - code format
* - remove unused commented codes
- add an empty line at the end of the blob.cc
---------
Co-authored-by: zhangyunze <z13785159769@163.com>
* add conv_half kernel
* Conv Kernel FP16
* dcj:
replace "DataType::Float32" with "op->getDType()" to support more DataType
* feat: support Float16 dtype
* fix: set default clang-format to 14 version
* fix: 按照review意见修改
* fix: add data convert to convfp16 kernel test
* test: add conv_fp16 kernel test
---------
Co-authored-by: zhangyue207 <zhangyue@qiyuanlab.com>
Co-authored-by: kilinchange <kilinchange@163.com>
reconfig: connections among op and tensor now is managered by GraphObj .
add some comments
merge from master
merge from master
ADD: sub graph replacement
reconfig inputs of op resize, due to the check of operator inputs.
ResizeObj::clone
clang format
fix some and add test for multi-output.
replacement support multi-inputs and multi-outputs.
add clone for all operators
add replaceSubGraph addSubGraph
remove extra code
add more test
remove extra print
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
fix review
change Device::MKL to Device::INTELCPU
fix mkl linkage
fix errors according to merge from master
now can call mkl backend
fix softmax/flatten with axis from onnx.
modify README.md
fix memory refree
add env_lotus_intelcpu.sh
fix compile
merge from branch cpu_backend
fix something add gather
fix something
FIX: directory rename from "mkl" to "intelcpu"
ADD: use oneMKL dpcpp interface to implement matmul kernel.
ADD: add dpcpp as compiler for mkl, and fix warnings for clang compiling.
add dpcpp kernel for pow.
ADD: mkl kernel for pad.
ADD: slice mkl kernel.
ADD: reshape/flatten/identity mkl kernel.
ADD: split mkl kernel.
fix compile error
FIX: fix flattenObj with axis.
ADD reduce_mean mkl kernel.
Add concat mkl kernel.
bathNorm for mkl kernel.
sigmoid mkl kernel.
ADD:add mkl kernel for pooling
add more tests for softmax
Now softmax cuda kernel supports any axises.
mkl kernel for softmax
softmax
add axis to softmax operator
add mkl kernel for abs tanh
ADD: relu kernel for mkl
fix binary mkl primitives.
add mkl kernel for binary operators
fix compiler error
move stream to runtime
clang format
add MemoryFormat for tensorObj.
use post_ops for fused conv/deconv
Distinguish mkl op_timer from cuda op timer.
add act optype to conv and deconv
add operator timer
add mkl kernel for convTransposed
minor fix for group conv
do not use cblas_sgemm_batch
CpuRuntimeObj->NativeCpuRuntimeObj
add matmul op for mkl
* support matmul
* add matmul
* add matmul
* add code for cnnl matmul operation and test
* add conv
* add code for conv test on mlu
* add code for test cnnl conv on mlu
* add code for perf conv and matmul on mlu
* clang format
* fix convolution operation
* fxi cmaklist
* code format
* fix code
* code format
---------
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: wanghailu <wanghailu0717@163.com>
* move memory format transformation to TensorObj
clang format
add MemoryFormat for tensorObj.
use post_ops for fused conv/deconv
Distinguish mkl op_timer from cuda op timer.
add act optype to conv and deconv
add operator timer
add mkl kernel for convTransposed
minor fix for group conv
do not use cblas_sgemm_batch
CpuRuntimeObj->NativeCpuRuntimeObj
add matmul op for mkl
* fix: fix bugs when rebasing from master
fix: fix bugs when rebasing from master
* fix: update api after rebasing
* fix: fix format; fix onnx import
* fix: fix clang-format
* [fix] fix conv_transpose test
* [fix] use stronger test case for transposed conv
* [fix] remove tensor memory format; fix mkl transpose conv
* [fix] add FIXME tag for op_timer python api
---------
Co-authored-by: whjthu <haojie0429@gmail.com>
ADD: resize operator and cuda kernel,support nearest/linear coef.
fix some
fix tests
add more tests for linear mode.
add linear coef mode.
add scales
add tests
fix tests.
add notLarger notSmaller
fix
add test
ADD:resize operator and cuda kernel
fix numInputs of batchNorm, add new line in file ending.
ADD: batch norm operator and cuda kernel.
add training
remove comments.
fix compile error.
add batch norm operator and cuda kernel.
* fix a little bug which found by new verison CMake
* add code for support BangC language kernel , just like Cuda kernel, not
library
* add bangc kernel
* support BangC kernel
* add code for support BangC kernel
* support bangc kernel
* fix some code from reviewer
* fix code of template fumction
* add code for support bangc kernel
* fix bangc format
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* ADD:concat/split operator and cuda kernels
refector
minor change comment
ADD:concat/split operator and cuda kernels
merge split_kernel and concat_kernel to split_concat_kernel.
Revert "fix"
This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c.
fix
ADD:concat/split operator and cuda kernels
change whole tensor name to composed tensor
fix some
remove unused header.
rebase
add CudaKernel
add test for split.
ADD split operator and cuda kernel.
modify test.
ADD:concat operator and cuda kernel.
ADD:concat/split operator and cuda kernels
fix some
remove unused header.
rebase
add CudaKernel
ADD:concat/split operator and cuda kernels
add test for split.
ADD split operator and cuda kernel.
modify test.
ADD:concat operator and cuda kernel.
* remove extra comment; typo fix.
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add: interface for membound TVM kernel and test
* add getAnsorCode
* add evaluation, but link failed
* add evaluation of kernel, but link failed
* Fix: link libcuda and nvrtc
* add print
* Add: const for source of copy
* compile and evaluate the kernel
* add compute
* fix gen_ansor_op.py
* fix membound_TVM
* format and fix CMakeLists.txt
* fix memory leak
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
Co-authored-by: huangshuhong <huangsh19@mails.tsinghua.edu.cn>
* add code for cambricon mlu, bang, cnnl
* add code for support cambricon mlu,cnnl,cnrt
* add code for support mlu
* add code for support cambricon cnnl
* add code for support mlu
* add code for mlu
* add code for mlu
`
* Update CMakeLists.txt
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: zhengly123 <zhengly123@outlook.com>
* ADD:reshape/flatten/identity operators and cuda kernel.
fix: use cudaMemcpyAsync
clang format.
ADD flatten/identity operator.
add test for reshape.
ADD: reshape operator and cuda kernel.
* Fix: seperate CUDA tests & remove old header
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* add code for activation operation
* add code for activation operation on GPU
* add test code for activation operation
* add code for activation operation
* add code for activation on gpu ,use cudnn
* add code for activation on GPU use cudnn
* Chore: add constants.h and remove comments
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* use protobuf for tensor data save,write,read, in chinese 序列化和反序列化
* add protobuf
* add code for tensor load & save from/to file
* add code for tensor laod & save
* add code for tensor load & save
* add code for tensor save & load
* add code for tensor save & load
* add code for save & load
* add code for load & save
* add code for tensor load & save
* add code for tensor save & load
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Fix some
remove useless code.
add div/pow kernel
Add add/mul/sub operators.
fix cpu kernel.
add element wise kenerl for cuda.
ADD element wise operator.
* Function tune and corresponding testcase.
*Add: Tune function in /src/kernel/cuda/conv.cc and corresponding testcase in test_conv.
*Fix: A little bug of perfRecord using in /src/core/runtime.cc.
* Tune part debug
*Add: recover the code, fixed the commit error.
*Add: some anotations in tune function
* clang formmat test
* Fix: mem leak in CUDA Runtime and Conv
* Fix: sync in conv and default sync in timeit
* Change the way to tune operator conv.
Timeit function cudNNUnfused -> Timeit function cudnnConvolutionForward.
* Change: merge the common part of cudnnunfused&tune into cudnndescriptoraccess
* clang test
* clang-format
* clang-format bash.
* Added operator G2BMM and corresponding testcase.
*Added files related to operator G2BMM creating&calling.
*Added custom_ops.cuh&custom_op.h.
* Add operator GBMML
* new version
* Fix: G2BMM and GBMM kernel bugs
* Added testcase of operator GBMML
* clang format
* Added cmake option REQUIRE_GCC9
* Delete redundent file
* Renamed class GBMML into GBMM
* clang format
* Reviewed.
* Added cudahostcompier option.
* Add: explicit CMAKE_CUDA_HOST_COMPILER
* Rename gbmm kernel
* Fix: nvcc warning in GBMM and G2BMM
Co-authored-by: wcz112 <wcz19@mails.tsinghua.edu.cn>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* Function tune and corresponding testcase.
*Add: Tune function in /src/kernel/cuda/conv.cc and corresponding testcase in test_conv.
*Fix: A little bug of perfRecord using in /src/core/runtime.cc.
* Tune part debug
*Add: recover the code, fixed the commit error.
*Add: some anotations in tune function
* clang formmat test
* Fix: mem leak in CUDA Runtime and Conv
* Fix: sync in conv and default sync in timeit
* Change the way to tune operator conv.
Timeit function cudNNUnfused -> Timeit function cudnnConvolutionForward.
* Change: merge the common part of cudnnunfused&tune into cudnndescriptoraccess
* clang test
* clang-format
* clang-format bash.
* Chore: remove print and blank lines
Co-authored-by: wcz112 <wcz19@mails.tsinghua.edu.cn>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
Class "Cuda Runtime" fulfills function "tune" and adds corresponding testcase.
*Add: convCudnn::tune, convCudnn::cuDNNdescriptorAccess.
*Add: testcase tune.
*Fix: a brief bug in CPU Runtime.
* Fix: add warm-up and repetition in timing
* Add: CUDA runtime and float support
* Refactor: Cuda and Cpu runtimes inherit Runtime
* Add: environment script for Lotus
* Add: Lotus build instructions
* Update README.md
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* Refactor: operator hash and inferShape
* Add: hash without shape
* Add: inferShape interface for given input tensors
* Add: construct outputs in op ctor
* Add: comments for matmul
* Add: opType in AttrVector and WorkloadVector
* Chore: _graph -> graph in Op ctor
* Chore: change the "Node" suffix to "Obj"
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>