Skip to content

Fix padding and vector width for device backend and add initial multi-GPU implementation for ConjGrad

Issue/feature addressed

  • Update to account for CUDA warp size of 32 and AMD wavefront size of 64.
  • Add device (CUDA, Kokkos, and SYCL) backend for interleave/deinterleave
  • Fix DirBndCondImpl.hpp for padding
  • Fix HelmholtzImplShared.hpp for padding
  • Fix HelmholtzSerialStdMat.hpp for padding
  • Fix MassImpl.hpp for padding
  • Fix AssmbScatrImplShared.hpp for padding
  • Tidy MathKernels
  • Tidy ConjGradImpl.hpp
  • Tidy AssmbScatr operator implementation
  • Remove Identity operators
  • Remove Matrix operators
  • Add CommDataType specialization based on MemoryRegion
  • Initial multi-GPU implementation for ConjGrad
  • Removed default argument value for alignment and vector width in Field and MemoryRegion object

Proposed solution

  1. Specify device vector width in NektarSpaces
class Serial
{
public:
    using memory_space = NektarSpaces::HostSpace;
    static constexpr size_t width = vec_t::width;
    static constexpr size_t alignment = vec_t::alignment;
};
class AVX
{
public:
    using memory_space = NektarSpaces::HostSpace;
    static constexpr size_t width = vec_t::width;
    static constexpr size_t alignment = vec_t::alignment;
};
class CUDA
{
public:
    using memory_space = NektarSpaces::DeviceSpace;
    static constexpr size_t width = 32;
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
};
class HIP
{
public:
    using memory_space = NektarSpaces::DeviceSpace;
    static constexpr size_t width = 64;
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
};
class SYCL
{
public:
    using memory_space = NektarSpaces::DeviceSpace;
    static constexpr size_t width = 64; //SYCLQueue::GetInstance().get_device().get_info<sycl::info::device::sub_group_sizes>();
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
};
class KOKKOS
{
public:
    using memory_space = NektarSpaces::DeviceSpace;
#if defined(KOKKOS_ENABLE_CUDA)
    static constexpr size_t width     = Kokkos::Impl::CudaTraits::WarpSize;
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
#elif defined(KOKKOS_ENABLE_HIP)
    static constexpr size_t width     = Kokkos::Impl::HIPTraits::WarpSize;
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
#elif defined(KOKKOS_ENABLE_SYCL)
    static constexpr size_t width     = 64;
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
#else
    static constexpr size_t width     = vec_t::width;
    static constexpr size_t alignment = __STDCPP_DEFAULT_NEW_ALIGNMENT__;
#endif
};
  1. Use backend-specific (instead of vec_t::width) vector width when calling GetBlockAttributes
GetBlockAttributes(FieldState::Coeff, expansionList, ExecSpace::width);
  1. Add CommDataType specialization based on MemoryRegion
namespace Nektar::LibUtilities
{
/**
 * Partial specialisation for memory region
 */
template <class elemT> class CommDataTypeTraits<MemoryRegion<elemT>>
{
public:
    static CommDataType &GetDataType()
    {
        return CommDataTypeTraits<elemT>::GetDataType();
    }
    static void *GetPointer(MemoryRegion<elemT> &val)
    {
        return val.template GetPtr<NektarSpaces::HostSpace, ReadWrite>();
    }
    static const void *GetPointer(const MemoryRegion<elemT> &val)
    {
        return val.template GetPtr<NektarSpaces::HostSpace, ReadWrite>();
    }
    static size_t GetCount(const MemoryRegion<elemT> &val)
    {
        return val.size();
    }
    const static bool IsVector = true;
};

} // namespace Nektar::LibUtilities

This allow MPI communicator to be called directly on MemoryRegion without copy to an Array

        m_rowComm->AllReduce(m_vExchange, Nektar::LibUtilities::ReduceSum);

here, m_vExchange is a MemoryRegion object.

Implementation

Tests

  • FwdTrans working with 2 processors for Serial, Kokkos-Serial, and Kokkos-CUDA backend
  • HelmSolve working with 2 processors for Serial, CUDA, Kokkos-Serial, and Kokkos-CUDA backend

Note: Current results suggest there are still some memory bugs somewhere in the code base but the basic infrastructure for MPI + GPU is there.

Suggested reviewers

Please suggest any people who would be appropriate to review your code.

Notes

Please add any other information that could be useful for reviewers.

Checklist

  • Functions and classes, or changes to them, are documented.
  • [ ] User guide/documentation is updated.
  • [ ] Changelog is updated.
  • Suitable tests added for new functionality.
  • Contributed code is correctly formatted. (See the contributing guidelines).
  • License added to any new files.
  • No extraneous files have been added (e.g. compiler output or test data files).
Edited by Jacques Xing

Merge request reports

Loading