InfiniTensor/include/cuda/cuda_split_concat.h


#pragma once
#include <cstdio>

const int BATCH_SIZE = 32; // parallel tensor number.
const int DIM_MAX_SIZE = 8;

// Concat operator acts like element tensors composing to one big tensor,and
// split operator acts like one big tensor being composed by element
// tensors.
template <typename T> struct ElementTensorMetadata {
    T *data[BATCH_SIZE];
    int dimBgNo[BATCH_SIZE]; // the dimention begin no of the element tensor in
                             // the composed tensor.
    int dimSize[BATCH_SIZE]; // the dimention size of the element tensor.
    int nElements[BATCH_SIZE]; // the number of elements of the element tensor.
    void print() const {
        for (int i = 0; i < BATCH_SIZE; i++)
            printf("%d:(data=%p,dimBgNo=%d,dimSize=%d,nElements=%d)\n", i,
                   data[i], dimBgNo[i], dimSize[i], nElements[i]);
    }
};
template <typename T> struct ComposedTensorMetadata {
    int dimSize[DIM_MAX_SIZE];
    int stride[DIM_MAX_SIZE];
    T *data;
};

namespace infini {
void split_concat_kernel(const ElementTensorMetadata<float> &eleMeta,
                         const ComposedTensorMetadata<float> &compMeta, int dim,
                         int batchSize, int nDims, bool isSplit);
void split_concat_kernel(const ElementTensorMetadata<half> &eleMeta,
                         const ComposedTensorMetadata<half> &compMeta, int dim,
                         int batchSize, int nDims, bool isSplit);
} // namespace infini
ADD:concat/split operator and cuda kernels (#29) * ADD:concat/split operator and cuda kernels refector minor change comment ADD:concat/split operator and cuda kernels merge split_kernel and concat_kernel to split_concat_kernel. Revert "fix" This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c. fix ADD:concat/split operator and cuda kernels change whole tensor name to composed tensor fix some remove unused header. rebase add CudaKernel add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. ADD:concat/split operator and cuda kernels fix some remove unused header. rebase add CudaKernel ADD:concat/split operator and cuda kernels add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. * remove extra comment; typo fix. Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2022-09-29 11:01:30 +08:00
			`#pragma once`
			`#include <cstdio>`

			`const int BATCH_SIZE = 32; // parallel tensor number.`
support 8D tensor, add test example (#170) Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2023-10-31 10:47:36 +08:00			`const int DIM_MAX_SIZE = 8;`
ADD:concat/split operator and cuda kernels (#29) * ADD:concat/split operator and cuda kernels refector minor change comment ADD:concat/split operator and cuda kernels merge split_kernel and concat_kernel to split_concat_kernel. Revert "fix" This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c. fix ADD:concat/split operator and cuda kernels change whole tensor name to composed tensor fix some remove unused header. rebase add CudaKernel add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. ADD:concat/split operator and cuda kernels fix some remove unused header. rebase add CudaKernel ADD:concat/split operator and cuda kernels add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. * remove extra comment; typo fix. Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2022-09-29 11:01:30 +08:00
			`// Concat operator acts like element tensors composing to one big tensor,and`
			`// split operator acts like one big tensor being composed by element`
			`// tensors.`
Modify kernel registration & support fp16 (#205) * - Remove dataType from the kernel registration. * - support fp16 for conv * - cpu kernel: adapt the new registration mechanism * modified all register kernel * add where fp16 * add layernorm fp16 * add split_concat fp16 * - element_wise support fp16 * feat: support transpose fp16 * feat: support sliceOp fp16 * - unary support fp16 * - feat: support reduceOp fp16 * feat: support matmulOp/expandOp fp16 * feat: support powOp int8 * add cuda cast & support half-precision for gather * style: fix style * feat:support int8 for gather * style:fix style * modified test_cuda_conv_transposed * fix: fix dist code to support fp16 * fix(graph.cc): fix topo_sort * fix: fix recv and send kernel registration * feat: add field tensors for stub * refactor(frontend): 先排序后构图 Signed-off-by: YdrMaster <ydrml@hotmail.com> * fix: 为中间结果提供tensor到node的mapping * fix (slice): add guard for area out of range * fix: fix matmul fp16 * fix: fix re-dataMalloc for weight tensor and use of naive allocator * feat: add dataType filter for cuda kernel * feat: bang kernel adapt the new registration mechanism * fix: fix some error on mlu * feat: intelcpu kernel adapt the new registration mechanism * feat: modify kernel registration on kunlun * fix intelcpu compiler bug * feat: bang reshape support all dataType * fix: fix bang reduce * fix(all_reduce.cc): fix as reviewer suggessted * fix: fix style and restore unary test codes --------- Signed-off-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: xgqdut2016 <kenan_gewei@163.com> Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com> Co-authored-by: zhangyunze <z13785159769@163.com> Co-authored-by: OdinaryWord <sx-hz@163.com> Co-authored-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: panzezhong <panzezhong@qiyuanlab.com> 2024-01-15 11:02:13 +08:00			`template <typename T> struct ElementTensorMetadata {`
			`T *data[BATCH_SIZE];`
ADD:concat/split operator and cuda kernels (#29) * ADD:concat/split operator and cuda kernels refector minor change comment ADD:concat/split operator and cuda kernels merge split_kernel and concat_kernel to split_concat_kernel. Revert "fix" This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c. fix ADD:concat/split operator and cuda kernels change whole tensor name to composed tensor fix some remove unused header. rebase add CudaKernel add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. ADD:concat/split operator and cuda kernels fix some remove unused header. rebase add CudaKernel ADD:concat/split operator and cuda kernels add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. * remove extra comment; typo fix. Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2022-09-29 11:01:30 +08:00			`int dimBgNo[BATCH_SIZE]; // the dimention begin no of the element tensor in`
			`// the composed tensor.`
			`int dimSize[BATCH_SIZE]; // the dimention size of the element tensor.`
			`int nElements[BATCH_SIZE]; // the number of elements of the element tensor.`
			`void print() const {`
			`for (int i = 0; i < BATCH_SIZE; i++)`
			`printf("%d:(data=%p,dimBgNo=%d,dimSize=%d,nElements=%d)\n", i,`
			`data[i], dimBgNo[i], dimSize[i], nElements[i]);`
			`}`
			`};`
Modify kernel registration & support fp16 (#205) * - Remove dataType from the kernel registration. * - support fp16 for conv * - cpu kernel: adapt the new registration mechanism * modified all register kernel * add where fp16 * add layernorm fp16 * add split_concat fp16 * - element_wise support fp16 * feat: support transpose fp16 * feat: support sliceOp fp16 * - unary support fp16 * - feat: support reduceOp fp16 * feat: support matmulOp/expandOp fp16 * feat: support powOp int8 * add cuda cast & support half-precision for gather * style: fix style * feat:support int8 for gather * style:fix style * modified test_cuda_conv_transposed * fix: fix dist code to support fp16 * fix(graph.cc): fix topo_sort * fix: fix recv and send kernel registration * feat: add field tensors for stub * refactor(frontend): 先排序后构图 Signed-off-by: YdrMaster <ydrml@hotmail.com> * fix: 为中间结果提供tensor到node的mapping * fix (slice): add guard for area out of range * fix: fix matmul fp16 * fix: fix re-dataMalloc for weight tensor and use of naive allocator * feat: add dataType filter for cuda kernel * feat: bang kernel adapt the new registration mechanism * fix: fix some error on mlu * feat: intelcpu kernel adapt the new registration mechanism * feat: modify kernel registration on kunlun * fix intelcpu compiler bug * feat: bang reshape support all dataType * fix: fix bang reduce * fix(all_reduce.cc): fix as reviewer suggessted * fix: fix style and restore unary test codes --------- Signed-off-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: xgqdut2016 <kenan_gewei@163.com> Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com> Co-authored-by: zhangyunze <z13785159769@163.com> Co-authored-by: OdinaryWord <sx-hz@163.com> Co-authored-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: panzezhong <panzezhong@qiyuanlab.com> 2024-01-15 11:02:13 +08:00			`template <typename T> struct ComposedTensorMetadata {`
ADD:concat/split operator and cuda kernels (#29) * ADD:concat/split operator and cuda kernels refector minor change comment ADD:concat/split operator and cuda kernels merge split_kernel and concat_kernel to split_concat_kernel. Revert "fix" This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c. fix ADD:concat/split operator and cuda kernels change whole tensor name to composed tensor fix some remove unused header. rebase add CudaKernel add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. ADD:concat/split operator and cuda kernels fix some remove unused header. rebase add CudaKernel ADD:concat/split operator and cuda kernels add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. * remove extra comment; typo fix. Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2022-09-29 11:01:30 +08:00			`int dimSize[DIM_MAX_SIZE];`
			`int stride[DIM_MAX_SIZE];`
Modify kernel registration & support fp16 (#205) * - Remove dataType from the kernel registration. * - support fp16 for conv * - cpu kernel: adapt the new registration mechanism * modified all register kernel * add where fp16 * add layernorm fp16 * add split_concat fp16 * - element_wise support fp16 * feat: support transpose fp16 * feat: support sliceOp fp16 * - unary support fp16 * - feat: support reduceOp fp16 * feat: support matmulOp/expandOp fp16 * feat: support powOp int8 * add cuda cast & support half-precision for gather * style: fix style * feat:support int8 for gather * style:fix style * modified test_cuda_conv_transposed * fix: fix dist code to support fp16 * fix(graph.cc): fix topo_sort * fix: fix recv and send kernel registration * feat: add field tensors for stub * refactor(frontend): 先排序后构图 Signed-off-by: YdrMaster <ydrml@hotmail.com> * fix: 为中间结果提供tensor到node的mapping * fix (slice): add guard for area out of range * fix: fix matmul fp16 * fix: fix re-dataMalloc for weight tensor and use of naive allocator * feat: add dataType filter for cuda kernel * feat: bang kernel adapt the new registration mechanism * fix: fix some error on mlu * feat: intelcpu kernel adapt the new registration mechanism * feat: modify kernel registration on kunlun * fix intelcpu compiler bug * feat: bang reshape support all dataType * fix: fix bang reduce * fix(all_reduce.cc): fix as reviewer suggessted * fix: fix style and restore unary test codes --------- Signed-off-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: xgqdut2016 <kenan_gewei@163.com> Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com> Co-authored-by: zhangyunze <z13785159769@163.com> Co-authored-by: OdinaryWord <sx-hz@163.com> Co-authored-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: panzezhong <panzezhong@qiyuanlab.com> 2024-01-15 11:02:13 +08:00			`T *data;`
ADD:concat/split operator and cuda kernels (#29) * ADD:concat/split operator and cuda kernels refector minor change comment ADD:concat/split operator and cuda kernels merge split_kernel and concat_kernel to split_concat_kernel. Revert "fix" This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c. fix ADD:concat/split operator and cuda kernels change whole tensor name to composed tensor fix some remove unused header. rebase add CudaKernel add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. ADD:concat/split operator and cuda kernels fix some remove unused header. rebase add CudaKernel ADD:concat/split operator and cuda kernels add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. * remove extra comment; typo fix. Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2022-09-29 11:01:30 +08:00			`};`

			`namespace infini {`
Modify kernel registration & support fp16 (#205) * - Remove dataType from the kernel registration. * - support fp16 for conv * - cpu kernel: adapt the new registration mechanism * modified all register kernel * add where fp16 * add layernorm fp16 * add split_concat fp16 * - element_wise support fp16 * feat: support transpose fp16 * feat: support sliceOp fp16 * - unary support fp16 * - feat: support reduceOp fp16 * feat: support matmulOp/expandOp fp16 * feat: support powOp int8 * add cuda cast & support half-precision for gather * style: fix style * feat:support int8 for gather * style:fix style * modified test_cuda_conv_transposed * fix: fix dist code to support fp16 * fix(graph.cc): fix topo_sort * fix: fix recv and send kernel registration * feat: add field tensors for stub * refactor(frontend): 先排序后构图 Signed-off-by: YdrMaster <ydrml@hotmail.com> * fix: 为中间结果提供tensor到node的mapping * fix (slice): add guard for area out of range * fix: fix matmul fp16 * fix: fix re-dataMalloc for weight tensor and use of naive allocator * feat: add dataType filter for cuda kernel * feat: bang kernel adapt the new registration mechanism * fix: fix some error on mlu * feat: intelcpu kernel adapt the new registration mechanism * feat: modify kernel registration on kunlun * fix intelcpu compiler bug * feat: bang reshape support all dataType * fix: fix bang reduce * fix(all_reduce.cc): fix as reviewer suggessted * fix: fix style and restore unary test codes --------- Signed-off-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: xgqdut2016 <kenan_gewei@163.com> Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com> Co-authored-by: zhangyunze <z13785159769@163.com> Co-authored-by: OdinaryWord <sx-hz@163.com> Co-authored-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: panzezhong <panzezhong@qiyuanlab.com> 2024-01-15 11:02:13 +08:00			`void split_concat_kernel(const ElementTensorMetadata<float> &eleMeta,`
			`const ComposedTensorMetadata<float> &compMeta, int dim,`
			`int batchSize, int nDims, bool isSplit);`
			`void split_concat_kernel(const ElementTensorMetadata<half> &eleMeta,`
			`const ComposedTensorMetadata<half> &compMeta, int dim,`
ADD:concat/split operator and cuda kernels (#29) * ADD:concat/split operator and cuda kernels refector minor change comment ADD:concat/split operator and cuda kernels merge split_kernel and concat_kernel to split_concat_kernel. Revert "fix" This reverts commit 459926be09a838658ec55f1e0a72b3cf17037d5c. fix ADD:concat/split operator and cuda kernels change whole tensor name to composed tensor fix some remove unused header. rebase add CudaKernel add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. ADD:concat/split operator and cuda kernels fix some remove unused header. rebase add CudaKernel ADD:concat/split operator and cuda kernels add test for split. ADD split operator and cuda kernel. modify test. ADD:concat operator and cuda kernel. * remove extra comment; typo fix. Co-authored-by: Haojie Wang <haojie0429@gmail.com> 2022-09-29 11:01:30 +08:00			`int batchSize, int nDims, bool isSplit);`
ADD: batch norm operator and cuda kernel. (#44) fix numInputs of batchNorm, add new line in file ending. ADD: batch norm operator and cuda kernel. add training remove comments. fix compile error. add batch norm operator and cuda kernel. 2022-10-15 16:29:28 +08:00			`} // namespace infini`