Tidy-up CUDA implementation of IProductWRTDerivBase and add CUDA kernels with additional parallelism
This MR tidy-up the previous implementation of the CUDA version of the IProductWRTDerivBase operator and introduces new CUDA kernels with additional parallelism across quadrature points.
Edited by Jacques Xing