* [feature] add cudagraph support
* modify code to pass the cuda_all_reduce test
* modify rope op
* support rmsnorm
* add fp16 support to silu cuda op
* fix bugs in rmsnorm op
* uncomment simplify in onnx.py
---------
Co-authored-by: Haojie Wang <haojie0429@gmail.com>