Skip to content

BwdTrans operator in CUDA

Ganlin requested to merge bwd_trans_sum_fac_cuda_draft into master

The proposed implementation of the CUDA version of the BwdTrans operator allocates memory on the GPU during the instantiation of the operator object. More specifically, matrices required for the BwdTrans operation are stored by using a std::vector<MemoryRegionCUDA<double>> member variable in combination with a std::map<ShapeKey, size_t> index map. The latter links a particular ShapeKey, which is related to particular element type, to an integer index. This allow an easy access to the relevant BwdTrans matrix during the operator function call and avoid repeated copy from CPU to GPU.

The CUDA kernal for the BwdTrans is only a preliminary implementation and require additional work.

Provisions have been made to add unit test for the CUDA implementation as the initial implementation considered CPU-only implementation. However, those provisions currently only serve as a placeholder.

Edited by Jacques Xing

Merge request reports