InfiniTensor/include/cuda/cuda_element_wise.h

#pragma once

namespace infini {
void div_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
                int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
                int c2, int c3);
void add_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
                int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
                int c2, int c3);
void pow_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
                int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
                int c2, int c3);
void less_kernel(int dtypeIndex, void *a, void *b, void *c, int a0, int a1,
                 int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,
                 int c2, int c3);
}; // namespace infini
ADD add/mul/sub/div/pow operators and CPU/CUDA kernels (#26) Fix some remove useless code. add div/pow kernel Add add/mul/sub operators. fix cpu kernel. add element wise kenerl for cuda. ADD element wise operator. 2022-09-09 13:43:59 +08:00			`#pragma once`

			`namespace infini {`
Modify kernel registration & support fp16 (#205) * - Remove dataType from the kernel registration. * - support fp16 for conv * - cpu kernel: adapt the new registration mechanism * modified all register kernel * add where fp16 * add layernorm fp16 * add split_concat fp16 * - element_wise support fp16 * feat: support transpose fp16 * feat: support sliceOp fp16 * - unary support fp16 * - feat: support reduceOp fp16 * feat: support matmulOp/expandOp fp16 * feat: support powOp int8 * add cuda cast & support half-precision for gather * style: fix style * feat:support int8 for gather * style:fix style * modified test_cuda_conv_transposed * fix: fix dist code to support fp16 * fix(graph.cc): fix topo_sort * fix: fix recv and send kernel registration * feat: add field tensors for stub * refactor(frontend): 先排序后构图 Signed-off-by: YdrMaster <ydrml@hotmail.com> * fix: 为中间结果提供tensor到node的mapping * fix (slice): add guard for area out of range * fix: fix matmul fp16 * fix: fix re-dataMalloc for weight tensor and use of naive allocator * feat: add dataType filter for cuda kernel * feat: bang kernel adapt the new registration mechanism * fix: fix some error on mlu * feat: intelcpu kernel adapt the new registration mechanism * feat: modify kernel registration on kunlun * fix intelcpu compiler bug * feat: bang reshape support all dataType * fix: fix bang reduce * fix(all_reduce.cc): fix as reviewer suggessted * fix: fix style and restore unary test codes --------- Signed-off-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: xgqdut2016 <kenan_gewei@163.com> Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com> Co-authored-by: zhangyunze <z13785159769@163.com> Co-authored-by: OdinaryWord <sx-hz@163.com> Co-authored-by: YdrMaster <ydrml@hotmail.com> Co-authored-by: panzezhong <panzezhong@qiyuanlab.com> 2024-01-15 11:02:13 +08:00			`void div_kernel(int dtypeIndex, void a, void b, void *c, int a0, int a1,`
			`int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,`
			`int c2, int c3);`
			`void add_kernel(int dtypeIndex, void a, void b, void *c, int a0, int a1,`
			`int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,`
			`int c2, int c3);`
			`void pow_kernel(int dtypeIndex, void a, void b, void *c, int a0, int a1,`
			`int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,`
			`int c2, int c3);`
			`void less_kernel(int dtypeIndex, void a, void b, void *c, int a0, int a1,`
			`int a2, int a3, int b0, int b1, int b2, int b3, int c0, int c1,`
			`int c2, int c3);`
ADD: batch norm operator and cuda kernel. (#44) fix numInputs of batchNorm, add new line in file ending. ADD: batch norm operator and cuda kernel. add training remove comments. fix compile error. add batch norm operator and cuda kernel. 2022-10-15 16:29:28 +08:00			`}; // namespace infini`