Split Serial, AVX, Kokkos, and CUDA implementation of math kernels, add MathKernel unit tests, fixes for padding and defaulting
Issue/feature addressed
- Split Serial, AVX, Kokkos, and CUDA implementations of Math kernels into separate files (Note: This will allow back-end specific implementations for better efficiency)
- Complete Serial and Kokkos implementations of some basic Math kernels
- Add AVX Math kernels implementation. The implementation is currently based on the legacy
VmathSIMD.hpp
and uses explicit unrolling. However aligned memory is assumed here, using non-aligned memory with those Math kernels will result in memory fault. (This might be of interested to you. Please feel for to eventually further optimized those kernels if you wish. Previously there was no explicit vectorization of those Math kernels in the redesign branch @xby2233) - Add unit tests for Serial, AVX and Kokkos Math kernels (previously only CUDA Math kernels were covered)
- Use static variable for memory allocation for CUDA Math kernels for better efficiency (Note: This static variable is currently not explicitly deallocated. Possible solution is to encapsulate Math kernels into a class to better manage static variable memory allocation/deallocation. However, Math kernels API would not change @ARSanderson)
- Add CUDA implementation of parallel_reduce and parallel_for (Note: while convenient to use, CUDA parallel_reduce and parallel_for, at least in their current form, prevents vector loading. Vector loading can improve performances by up to 20%)
- A Field/MemoryRegion based wrapper has been added to the MathKernel functions to allow automatic defaulting behaviour without repeated manual call to GetPtr
- Add provision for padding consideration to the MathKernel functions wrapper
- Fix DiagPrecond for padding consideration (aka "alignmentMismatch")
- Fix ConjGrad for padding consideration (aka "alignmentMismatch")
Proposed solution
Implementation
Two benchmarking results are presented below
daxpy
Results are obtained with:
- Serial run with SIMD off build
- Kokkos run with SIMD off build
- Serial run with SIMD on build (auto vectorization)
- AVX run with (explicit vectorization and unrolling)
- AVX run with (parallel_for)
- Kokkos run with SIMD on build (auto vectorization)
- Kokkos run with SIMD on build (explicit vectorization)
The explicit vectorized Kokkos implementation is given by (Note: does not consider spill over)
using namespace tinysimd;
using vec_t = simd<TData>;
Nektar::parallel_for<ExecSpace>(0, nsize / vec_t::width, [&](const unsigned int i){
vec_t xChunk; xChunk.load(x + i * vec_t::width, is_aligned);
vec_t yChunk; yChunk.load(y + i * vec_t::width, is_aligned);
yChunk.fma(vec_t(xChunk), alpha);
yChunk.store(z + i * vec_t::width, is_aligned);
});
}
Note:
- AVX : explicit vectorization and unrolling is the current MR implementation
- Kokkos: auto-vectorization is the current MR implementation
Observation:
- Significant overhead associated with Kokkos
parallel_for
launch
reduction
Results are obtained with:
- Serial run with SIMD off build
- Kokkos run with SIMD off build
- Serial run with SIMD on build (auto vectorization)
- AVX run with (explicit vectorization and unrolling)
- AVX run with (explicitly vectorized parallel_for)
- Kokkos run with SIMD on build (auto vectorization)
The explicit vectorized parallel_for implementation is given by (Note: does not consider spill over)
using namespace tinysimd;
using vec_t = simd<TData>;
vec_t outChunk;
*out = 0.0;
Nektar::parallel_reduce<ExecSpace, NektarSpaces::Sum<vec_t>>(
0, nsize / vec_t::width,
[&](const unsigned int i, vec_t &sum) {
vec_t xChunk; xChunk.load(x + i * vec_t::width, is_aligned);
vec_t yChunk; yChunk.load(y + i * vec_t::width, is_aligned);
sum.fma(vec_t(xChunk), vec_t(yChunk)); },
outChunk);
alignas(vec_t::alignment) typename vec_t::scalarArray tmp;
outChunk.store(tmp, is_aligned);
for (unsigned int i = 0; i < vec_t::width; i++)
{
*out += tmp[i];
}
Note:
- AVX : explicit vectorization and unrolling is the current MR implementation
- Kokkos: auto-vectorization is the current MR implementation
Observation:
- Kokkos run with SIMD on build (auto vectorization) is less good compared to Kokkos run with SIMD off build
- Explicitly vectorized parallel_for not working with Kokkos because of template instantiation error with
vec_t
- Significant overhead associated with Kokkos
parallel_reduce
launch
Tests
Suggested reviewers
Please suggest any people who would be appropriate to review your code.
Notes
Please add any other information that could be useful for reviewers.
Checklist
-
Functions and classes, or changes to them, are documented. [ ] User guide/documentation is updated.[ ] Changelog is updated.-
Suitable tests added for new functionality. -
Contributed code is correctly formatted. (See the contributing guidelines). -
License added to any new files. -
No extraneous files have been added (e.g. compiler output or test data files).
Edited by Jacques Xing