Commit Graph

12 Commits

Author SHA1 Message Date
constroy Li f60767a770
impl distributed launch with NCCL (#106)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* Add broadcast operator and cuda kernel

* Add comments for operators

* remove const of class member

* move communicator to CudaRuntimeObj

* Add an empty line at EOF.

---------

Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-05 09:47:35 +08:00
zhengly123 a1974aabcd
NNET supports TVM backend and kernels (#78)
* Add: mutator InfoGAN minimum test

* Add: cache and padding (bugs!!)

* Add: expression reader as a cmake target

* Fix: [Intermediate] NMutator::expressionToGraph

To be fix: matmul with implicit broadcast

* Add: matmul broadcast

* Fix: GraphObj ctor should use cloneTensor

* Fix: cuBLAS failure when codegen is enabled

* Add: Exception for checkCuError

* Fix: graph OpList ctor

* Add: expr simplication for TVM

* Add: TVM headers and CMake include paths

* Add: CMake config

* Add: PackedFunc (broken)

* Fix: remove cuCtxCreate which makes TVM fails

* Fix: membound_tvm

* Fix: test_memboundOp

* Add: PRelu Expr and AsTVMVisitor

* Add: Random generator

* Add: support TVM packed function

* Fix: specify runtime

* Add: CMake support of TVM

* Add: detailed output of Matmul

* Add: comments for Matmul

* Chore: format and comments

* Chore: GraphObj::selfCheck without assert control

* Fix: CMAKE_CXX_FLAGS in CMakeLists

* fix merge bug

* update api for mkl batchnorm test

* fix lotus env

* fig header bug

---------

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
Co-authored-by: huangshuhong <huangsh19@mails.tsinghua.edu.cn>
Co-authored-by: whjthu <haojie0429@gmail.com>
2023-04-18 00:26:36 +08:00
zhengly123 c7ec9ee6e7
Add search engine (#64)
* Add: tensor fuid

* [Intermediate state] Add: Graph ctor for OpVec

* Add: clone for operators

* tmp: search_engine

* search: init search Engine.

* Add: dummy mutator for the test of search engine

* search: add print graph.

* search: add partition.

* search: update comments.

* Fix: remain FUID in Tensor::clone

* Chore: rename GUidBaseType to UidBaseType

* Fix: connect NMutator to SearchEngine

* Chore: output

* Fix test_memboundOp: nmutator uses input runtime

* Chore: clang-format

* Chore: clang-format

* Fix: comments in the review

---------

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
Co-authored-by: mazx <dyxdy@live.com>
2023-02-12 18:27:52 +08:00
zhengly123 63d8aff985
Fix: cuCtxCreate before other initialization (#49)
Fix: create cuCtx at the very beginning

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2022-10-19 15:41:48 +08:00
wendy12022 a4d6426589
ADD: batch norm operator and cuda kernel. (#44)
fix numInputs of batchNorm, add new line in file ending.

ADD: batch norm operator and cuda kernel.

add training

remove comments.

fix compile error.

add batch norm operator and cuda kernel.
2022-10-15 16:29:28 +08:00
zhengly123 1152adc94a
Add: python API for timing ConvTranspose (#46)
* Add: python interfaced for timing operators

* Fix: CUDA Runtime run

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2022-10-07 16:03:11 +08:00
deathwings602 11d5aa1ccc
Add TVM codegen for MemboundOp (#35)
* Add:  interface for membound TVM kernel and test

* add getAnsorCode

* add evaluation, but link failed

* add evaluation of kernel, but link failed

* Fix: link libcuda and nvrtc

* add print

* Add: const for source of copy

* compile and evaluate the kernel

* add compute

* fix gen_ansor_op.py

* fix membound_TVM

* format and fix CMakeLists.txt

* fix memory leak

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
Co-authored-by: huangshuhong <huangsh19@mails.tsinghua.edu.cn>
2022-09-22 18:06:45 +08:00
Anmuliar bd63f738dc
cuDNN conv tuning (#16)
* Function tune and corresponding testcase.

*Add: Tune function in /src/kernel/cuda/conv.cc and corresponding testcase in test_conv.

*Fix: A little bug of perfRecord using in /src/core/runtime.cc.

* Tune part debug

*Add: recover the code, fixed the commit error.

*Add: some anotations in tune function

* clang formmat test

* Fix: mem leak in CUDA Runtime and Conv

* Fix: sync in conv and default sync in timeit

* Change the way to tune operator conv.

Timeit function cudNNUnfused -> Timeit function cudnnConvolutionForward.

* Change: merge the common part of cudnnunfused&tune into cudnndescriptoraccess

* clang test

* clang-format

* clang-format bash.

* Chore: remove print and blank lines

Co-authored-by: wcz112 <wcz19@mails.tsinghua.edu.cn>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2022-08-29 21:37:07 +08:00
Anmuliar e076991f2f
Revert "Operator serialization (#14)" (#15)
This reverts commit 25f0c441d2.
2022-08-29 16:02:48 +08:00
Anmuliar 25f0c441d2
Operator serialization (#14)
Class "Cuda Runtime" fulfills function "tune" and adds corresponding testcase.

*Add: convCudnn::tune, convCudnn::cuDNNdescriptorAccess.

*Add: testcase tune.

*Fix: a brief bug in CPU Runtime.
2022-08-29 15:59:03 +08:00
zhengly123 af08df32d2
Extended DataType class and Runtime interaction (#9)
* Add: DataType class

* Add: data-type-oblivious tensor interface

* Rename: copyBlobToCPU

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2022-08-23 16:55:59 +08:00
zhengly123 04ea5eed38
Add CUDA runtime (#6)
* Fix: add warm-up and repetition in timing

* Add: CUDA runtime and float support

* Refactor: Cuda and Cpu runtimes inherit Runtime

* Add: environment script for Lotus

* Add: Lotus build instructions

* Update README.md

Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2022-08-22 15:01:03 +08:00