fix numInputs of batchNorm, add new line in file ending. ADD: batch norm operator and cuda kernel. add training remove comments. fix compile error. add batch norm operator and cuda kernel.
* Fix: PerfData in a shared pointer * Add: abstraction for kernels without configuration Co-authored-by: Liyan Zheng <liyan-zheng@outlook.com>