Commit Graph

2 Commits

Author SHA1 Message Date
constroy Li 4c321c8a91
tensor parallel for transformer (#125)
* add cmake bits about NCCL

* move example to examples/NNmodel

* impl NCCL communicator

* add comm related function to Runtime

* export runtime interface

* add launch.py

* use unique name to distingush the the NCCL ID file

* add timeout to communicator init

* expose communicator obj from runtime obj, add unit test for nccl communicator

* reformat files

* Add allReduce operator and cuda nccl allReduce kernel

* impl model parallel for resnet

* add allGather nccl kernel and operator

* Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output

* fix format of onnx.py

* use concat following AllGather

* get tensor parallel for resnet

* fix format of graph_handler.cc

* change BUILD_DIST default to OFF

* polish code of communicator

* update .gitignore

* export min/max to python

* fix MatMul

* modify launch.py to run opt

* hack to treat ReduceSum as AllReduceSum

* throw exception in cuda error

* fix parallel_opt.py

* improve the error prompt and cuda error check

* fix GatherObj::GatherObj member init

* fix size calculation for scalar (rank = 0) tensor

* MatMul supports bias

* fix add bias for row parallel gemm

* add --gen_std to launch.py

* fix AllReduceNCCL

* update launch.py

* less log

* update parallel_opt

* update launch.py

* add __eq__ for Placement sub-classes

* less benchmark run

* fix placement infer for matmul

* fix vacabuary size

* fix Exception

* Add shard tensor with group to support gpt2

* Add find successor function to find split op at different depth

* recover CommunicatorObj

* improve error mesasge

* optimize parallel_opt.py

* optimize launch.py

* recover docs for all_reduce and all_gather

* Fix API

* fix format

---------

Co-authored-by: panzezhong <panzezhong@qiyuanlab.com>
Co-authored-by: Haojie Wang <haojie0429@gmail.com>
2023-09-14 14:19:45 +08:00
Hardy 32a01efbbe
add code for backtrace (#21)
* add code for backtrace

* Add: infini::Exception

```
Test project /home/zly/InfiniTensor_aux/build
    Start 1: test_graph
1/4 Test #1: test_graph .......................   Passed    0.05 sec
    Start 2: test_hash
2/4 Test #2: test_hash ........................   Passed    0.02 sec
    Start 3: test_conv
3/4 Test #3: test_conv ........................   Passed    4.40 sec
    Start 4: test_pooling
4/4 Test #4: test_pooling .....................   Passed    2.47 sec

100% tests passed, 0 tests failed out of 4

Total Test time (real) =   6.94 sec
```

* Fix: USE_BACKTRACE in cmake

Co-authored-by: wanghailu <wanghailu@qiyuanlab.com>
Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>
2022-09-01 20:30:12 +08:00