Tidy-up CUDA implementation of PhysDeriv and add CUDA kernels with additional parallelism
Loading
This MR tidy-up the previous implementation of the CUDA version of the PhysDeriv operator and introduce new CUDA kernels with additional parallelism across quadrature points.