Tidy-up CUDA implementation of BwdTrans and update CUDA kernels with additional parallelism
This MR tidy-up the previous implementation of the CUDA version of the PhysDeriv operator
Edited by Jacques Xing
This MR tidy-up the previous implementation of the CUDA version of the PhysDeriv operator