InfiniTensor/include
constroy Li feccd4f318
fix tensor parallel for llama (#159)
* fix Slice

* change default rounds of timeit to 10 to reduce time

* fix slice with large ends

* Reshape support Int64

* support position_ids as input

* skip last MatMul in Llama

* skip infer_shapes to parse large model

* update launch.py

* fix split_concat_kernel

* print more message in launch.py

* Reshape supports both Int32 and Int64

* try infer_shapes and warn about failure

* fix format

---------

Co-authored-by: whjthu <haojie0429@gmail.com>
2023-10-30 15:04:16 +08:00
..
bang fix bang runtime bug after merging distributed branch (#137) 2023-09-19 14:10:39 +08:00
core fix tensor parallel for llama (#159) 2023-10-30 15:04:16 +08:00
cuda Add GatherElements op and cuda kernel (#149) 2023-10-12 09:18:12 +08:00
ffi Add TVM codegen for MemboundOp (#35) 2022-09-22 18:06:45 +08:00
intelcpu Cpu backend2 (#77) 2023-04-17 12:15:23 +08:00
kunlun Xpu (#82) 2023-10-16 10:57:08 +08:00
nnet Dev for 202303ddl (#66) 2023-04-18 15:10:33 +08:00
operators add transpose, concat and split for native cpu (#158) 2023-10-12 10:14:28 +08:00
utils tensor parallel for transformer (#125) 2023-09-14 14:19:45 +08:00
test.h Add python interface for CUDA operator evaluation (#42) 2022-09-27 10:41:12 +08:00