Skip to content

Update BwdTrans CUDA kernel implementation

Issue/feature addressed

This MR tidy-up BwdTrans CUDA kernels in prevision of future benchmarking tests

Here are a summary of the changes:

  • Tidy-up BwdTrans CUDA kernel implementation
  • Simplify loops in kernel functions
  • Improved coalesced memory access to global memory
  • Add templated option for shared memory
  • Add 1D indexing version of QP (multilevel) Kernels
  • Interleave data for non-QP kernels

To do (future MRs):

  • Interleave data for non-QP kernels
  • Consider templating nm0, nm1, nm2, nq0, nq1, nq2 parameters
  • Optimizing CUDA grid parameters

Proposed solution

Implementation

Tests

Suggested reviewers

Please suggest any people who would be appropriate to review your code.

Notes

Please add any other information that could be useful for reviewers.

Checklist

  • Functions and classes, or changes to them, are documented.
  • [ ] User guide/documentation is updated.
  • [ ] Changelog is updated.
  • [ ] Suitable tests added for new functionality.
  • Contributed code is correctly formatted. (See the contributing guidelines).
  • [ ] License added to any new files.
  • No extraneous files have been added (e.g. compiler output or test data files).
Edited by Jacques Xing

Merge request reports

Loading