Update PhysDeriv CUDA kernel implementation
Issue/feature addressed
This MR tidy-up PhysDeriv CUDA kernels in prevision of future benchmarking tests
Here are a summary of the changes:
- Tidy-up PhysDeriv CUDA kernel implementation
- Fuse kernel functions for better efficiency
- Simplify loops in kernel functions
- Improved coalesced memory access to global memory
- Add templated option for shared memory
- Add 1D indexing version of QP (multilevel) Kernels
- Interleave data for non-QP kernels
To do (future MRs):
Interleave data for non-QP kernels- Consider templating nq0, nq1, nq2 parameters
- Optimizing CUDA grid parameters
Proposed solution
Implementation
Tests
Suggested reviewers
Please suggest any people who would be appropriate to review your code.
Notes
Please add any other information that could be useful for reviewers.
Checklist
-
Functions and classes, or changes to them, are documented. [ ] User guide/documentation is updated.[ ] Changelog is updated.[ ] Suitable tests added for new functionality.-
Contributed code is correctly formatted. (See the contributing guidelines). [ ] License added to any new files.-
No extraneous files have been added (e.g. compiler output or test data files).
Edited by Jacques Xing