This MR tidy-up the previous implementation of the CUDA version of the IProductWRTBase operator and introduces new CUDA kernels with additional parallelism across quadrature points.