Implement CUDA BwdTrans sum-factorization kernels
CUDA kernels for BwdTrans
operation have been implemented for all element types (seg, quad, tri, hex, tet, prism, and pyr). All implementations include two variants, one that only considers threading across the elements (which is analogous to the SIMD-based matrix-free implementation), and one, noted QP
, that considers threading across both the elements and quadrature points.
To avoid multiple copy from CPU to GPU, basis data are copied to the GPU using the constructor.
This MR also removes the square2.xml
XML files but introduces segment.xml
, square_all_elements.xml
, and cube_all_elements.xml
XML files which collectively include seg, quad, tri, hex, tet, prism, and pyr element types and the current implementation has been partially validated for all element types.