* add cmake bits about NCCL
* move example to examples/NNmodel
* impl NCCL communicator
* add comm related function to Runtime
* export runtime interface
* add launch.py
* use unique name to distingush the the NCCL ID file
* add timeout to communicator init
* expose communicator obj from runtime obj, add unit test for nccl communicator
* reformat files
* Add allReduce operator and cuda nccl allReduce kernel
* impl model parallel for resnet
* add allGather nccl kernel and operator
* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output
* fix format of onnx.py
* use concat following AllGather
* get tensor parallel for resnet
* fix format of graph_handler.cc
* change BUILD_DIST default to OFF
* polish code of communicator
* update .gitignore
* export min/max to python
* fix MatMul
* modify launch.py to run opt
* hack to treat ReduceSum as AllReduceSum
* throw exception in cuda error
* fix parallel_opt.py
* improve the error prompt and cuda error check
* fix GatherObj::GatherObj member init
* fix size calculation for scalar (rank = 0) tensor
* MatMul supports bias
* fix add bias for row parallel gemm
* add --gen_std to launch.py
* fix AllReduceNCCL
* update launch.py
* less log
* update parallel_opt
* update launch.py
* add __eq__ for Placement sub-classes
* less benchmark run
* fix placement infer for matmul
* fix vacabuary size
* fix Exception
* Add shard tensor with group to support gpt2
* Add find successor function to find split op at different depth
* recover CommunicatorObj
* improve error mesasge
* optimize parallel_opt.py
* optimize launch.py
* recover docs for all_reduce and all_gather
* Fix API
* fix format
---------
Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
* Add cuda transpose kernel
* Empty line cuda_transpose.h
* Empty line small_array.h
* empty line transpose.cc
* empty line transpose.cu
* empty line test_cuda_transpose.cc
* feat: support mixed dtype
* feat: support cast op
* test: add test for cast op
* feat: support datatype BFloat16
* feat: support data convert fp32 <-> bfp16
* fix: fix all op's infershape func
* fix as review comment
* add conv_half kernel
* Conv Kernel FP16
* dcj:
replace "DataType::Float32" with "op->getDType()" to support more DataType
* feat: support Float16 dtype
* fix: set default clang-format to 14 version
* fix: 按照review意见修改
* fix: add data convert to convfp16 kernel test
* test: add conv_fp16 kernel test
---------
Co-authored-by: zhangyue207 <zhangyue@qiyuanlab.com>
Co-authored-by: kilinchange <kilinchange@163.com>
fix numInputs of batchNorm, add new line in file ending.
ADD: batch norm operator and cuda kernel.
add training
remove comments.
fix compile error.
add batch norm operator and cuda kernel.
* use protobuf for tensor data save,write,read, in chinese 序列化和反序列化
* add protobuf
* add code for tensor load & save from/to file
* add code for tensor laod & save
* add code for tensor load & save
* add code for tensor save & load
* add code for tensor save & load
* add code for save & load
* add code for load & save
* add code for tensor load & save
* add code for tensor save & load
Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>