* feat: support dynamic tensor part1
* feat: support dynamic-tensor part2
* feat: support dynamic tensor part 3
* fix: fix some ..
* - add kvcache example
* feat: support concat to identity kernel
* add a simple mempory pool for allocator
* fix: rebase to master
* fix bug after merging
* - remove outdated script
* fix: fix as review
---------
Co-authored-by: kilinchange <kilinchange@163.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* export min/max to python
* fix MatMul
* modify launch.py to run opt
* hack to treat ReduceSum as AllReduceSum
* throw exception in cuda error
* fix parallel_opt.py
* improve the error prompt and cuda error check
* fix GatherObj::GatherObj member init
* fix size calculation for scalar (rank = 0) tensor
* MatMul supports bias
* fix add bias for row parallel gemm
* add --gen_std to launch.py
* fix AllReduceNCCL
* update launch.py
* less log
* update parallel_opt
* update launch.py
* add __eq__ for Placement sub-classes
* less benchmark run
* fix placement infer for matmul
* fix vacabuary size
* fix Exception
* Add shard tensor with group to support gpt2
* Add find successor function to find split op at different depth
* recover CommunicatorObj
* improve error mesasge
* optimize parallel_opt.py
* optimize launch.py
* recover docs for all_reduce and all_gather
* - support concat for kvcache
* - modify allocator
* - add tensorType
- modify allocator to support memory allocation based on tensorType
* - fix allocator init
* - support kvcache by running 2 stub distributively
* - fix name
* - remove unused flag
* - fix wrong pb name
* - fix as constroy suggessed
* - fix launch.py format
---------
Co-authored-by: constroy <constroy.li@gmail.com>
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
* - add LazyAllocator class
- calculate memory consumption at present
* - basic function of lazy_allocator, remaining test
* - modify LazyAllocator
* - modify InfiniTensor to fit LazyAllocator
* - add setDataBlob
- modify alignment
- fix GraphObj::dataMalloc
* - modified alignment value(64bytes -> 8bytes)
- fix LazyAllocator::getPtr()
- some dubug codes and commonts
- do alignment by chaning size instead of tailAddr
* - fix some problem
* - translate chinese comments to english
* - format codes
* - fix test
* - code format
* - modify codes as YdrMaser and bitzyz suggested
* - code format
* - modify codes as constroy suggested
* - codes format
* - modify alignment on cuda
* - code format
* - add test_lazy_allocator
- fix tests where not add input tensor into graph.tensors
- fix tests where init tensor's data before calling graph->dataMallocate()
* - code format
* - remove gpu runtime in test_lazy_allocator
* - fix test_lazy_allocator: remove cuda include
* - add test
* - code format
* - add ifdef for test of allocator
* - code format
* - fix test: remove unused ifdef
* - fix bang test
* - code format
* Merge branch 'master' into dcj/memory_allocator
* fix: fix cuda conv_fp16 run fail
* fix bang_runtime.cc and cuda_runtime.cc
* - update mkl code
* - fix codes for mkl
* - code format
* - remove unused commented codes
- add an empty line at the end of the blob.cc
---------
Co-authored-by: zhangyunze <z13785159769@163.com>