* Add reduceSum op and kernel
* fix merge and format
* Reduce: reuse cat macro, add doc string
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add ceil mode for pooling
* do not print debug info for allocator by default
* fix test bugs after introducing pooling ceil mode
* fix onnx import bug
* feat: support mixed dtype
* feat: support cast op
* test: add test for cast op
* feat: support datatype BFloat16
* feat: support data convert fp32 <-> bfp16
* fix: fix all op's infershape func
* fix as review comment
* - add LazyAllocator class
- calculate memory consumption at present
* - basic function of lazy_allocator, remaining test
* - modify LazyAllocator
* - modify InfiniTensor to fit LazyAllocator
* - add setDataBlob
- modify alignment
- fix GraphObj::dataMalloc
* - modified alignment value(64bytes -> 8bytes)
- fix LazyAllocator::getPtr()
- some dubug codes and commonts
- do alignment by chaning size instead of tailAddr
* - fix some problem
* - translate chinese comments to english
* - format codes
* - fix test
* - code format
* - modify codes as YdrMaser and bitzyz suggested
* - code format
* - modify codes as constroy suggested
* - codes format
* - modify alignment on cuda
* - code format
* - add test_lazy_allocator
- fix tests where not add input tensor into graph.tensors
- fix tests where init tensor's data before calling graph->dataMallocate()
* - code format
* - remove gpu runtime in test_lazy_allocator
* - fix test_lazy_allocator: remove cuda include
* - add test
* - code format
* - add ifdef for test of allocator
* - code format
* - fix test: remove unused ifdef
* - fix bang test
* - code format
* Merge branch 'master' into dcj/memory_allocator
* fix: fix cuda conv_fp16 run fail
* fix bang_runtime.cc and cuda_runtime.cc
* - update mkl code
* - fix codes for mkl
* - code format
* - remove unused commented codes
- add an empty line at the end of the blob.cc
---------
Co-authored-by: zhangyunze <z13785159769@163.com>
* add conv_half kernel
* Conv Kernel FP16
* dcj:
replace "DataType::Float32" with "op->getDType()" to support more DataType
* feat: support Float16 dtype
* fix: set default clang-format to 14 version
* fix: 按照review意见修改
* fix: add data convert to convfp16 kernel test
* test: add conv_fp16 kernel test
---------
Co-authored-by: zhangyue207 <zhangyue@qiyuanlab.com>
Co-authored-by: kilinchange <kilinchange@163.com>
reconfig: connections among op and tensor now is managered by GraphObj .
add some comments
merge from master
merge from master
ADD: sub graph replacement
reconfig inputs of op resize, due to the check of operator inputs.
ResizeObj::clone
clang format
fix some and add test for multi-output.
replacement support multi-inputs and multi-outputs.
add clone for all operators
add replaceSubGraph addSubGraph
remove extra code
add more test
remove extra print
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* move memory format transformation to TensorObj
clang format
add MemoryFormat for tensorObj.
use post_ops for fused conv/deconv
Distinguish mkl op_timer from cuda op timer.
add act optype to conv and deconv
add operator timer
add mkl kernel for convTransposed
minor fix for group conv
do not use cblas_sgemm_batch
CpuRuntimeObj->NativeCpuRuntimeObj
add matmul op for mkl
* fix: fix bugs when rebasing from master
fix: fix bugs when rebasing from master
* fix: update api after rebasing
* fix: fix format; fix onnx import
* fix: fix clang-format
* [fix] fix conv_transpose test
* [fix] use stronger test case for transposed conv
* [fix] remove tensor memory format; fix mkl transpose conv
* [fix] add FIXME tag for op_timer python api
---------
Co-authored-by: whjthu <haojie0429@gmail.com>
* use protobuf for tensor data save,write,read, in chinese 序列化和反序列化
* add protobuf
* add code for tensor load & save from/to file
* add code for tensor laod & save
* add code for tensor load & save
* add code for tensor save & load
* add code for tensor save & load
* add code for save & load
* add code for load & save
* add code for tensor load & save
* add code for tensor save & load
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
* Fix: add warm-up and repetition in timing
* Add: CUDA runtime and float support
* Refactor: Cuda and Cpu runtimes inherit Runtime
* Add: environment script for Lotus
* Add: Lotus build instructions
* Update README.md
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
* Refactor: operator hash and inferShape
* Add: hash without shape
* Add: inferShape interface for given input tensors
* Add: construct outputs in op ctor
* Add: comments for matmul
* Add: opType in AttrVector and WorkloadVector
* Chore: _graph -> graph in Op ctor
* Chore: change the "Node" suffix to "Obj"
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>