James Edgeley requested to merge jedgeley/nektar:execution into feature/redesign Feb 08, 2024

Issue/feature addressed

Addition of Execution library to provide a DAG execution structure for efficient ordering of the redesign operators

Proposed solution

This MR adds the standalone library but does not connect it to other code. Currently requires c++20

Implementation

The Execution library aims to improve performance of Nektar++ operations by reducing them to bare-bones kernels, arranging the kernels into a directed acyclic graph (DAG) which respects their dependencies, and then executing the kernels using the full range of hardware available to the user. Part of the optimisation comes from avoiding unnecessary copies and allocations that would otherwise be repeated for every time step or iteration in a solver, at the cost of a single graph setup phase. Further performance improvements can come from spreading the workload over a range of devices, as in some cases it is possible to split one operation into several nodes, which can be executed on different devices, using the optimal algorithm for the particular combination of device, geometry and order.

Currently the library offers a choice of backends of type ExecutionModel, which determine the specialisations for a number of key template classes. The chosen backend is held in the Execution_t variable. The options are:

SerialExecutionModel: Nodes are executed sequentially, on the CPU
ThreadExecutionModel: Nodes may be executed in parallel if possible, using CPU threads
SyclExecutionModel (WIP): Nodes are executed by a SYCL queue.
CudaExecutionModel: Nodes are executed in parallel on the GPU and CPU. Nodes on the GPU can contain a Cuda kernel, providing multiple levels of parallelism
DebugExecutionModel: There is no graph; kernels are instead executed as they are added. May not provide the correct result if nodes are added out of order

The important template classes for this library are as follows:

GraphDetail: This class contains a list of NodeDetail objects, and information about their connectivity.
NodeDetail: This class typically contains a kernel, or another operation such as a copy or allocation. The type of operation (kernel or otherwise) is held in a std::variant The operation in the node is performed only when all prior connected nodes have completed execution.

The typedefs Node_t = NodeDetail<Execution_t> and Graph_t = GraphDetail<Execution_t> are frequently employed.

All GraphDetail specialisations have functionality in common:

A call to Launch() commences execution of the graph structure, starting with the root nodes.
A call to Wait() prevents further progression through the program until all of the leaf nodes have completed.
A call to Print(std::ostream &os) writes the structure of the graph to the os stream in DOT format.
AddDependency(Node_t *pre, Node_t *suc) adds an edge from *pre to *suc

GraphDetail<ThreadExecutionModel> and GraphDetail<SerialExecutionModel> contain an instance of Executor template class which handles the specifics of the execution ordering. GraphDetail<CudaExecutionModel> contains a CUGraph, which handles the execution using the CUDA driver API's own graph functionality.

Objects referred to in the graph (for example the arguments of any kernels) must remain in scope until completion of execution, otherwise the kernel will not be able to find them when it is executed.

Calls to MPI functions are also possible to allow for communication between graphs on different ranks - there are dedicated NodeDetail variants for this purpose. In CudaExecutionModel there is an assigned thread on the CPU that repeatedly checks for a memory flag whose value indicates that whether it should make an MPI call. CUDA memory operations are used to write and wait on this flag to signal when the CPU should make the call, and when the call has been made (this feature currently necessitates the use of the driver API).

It is also possible to condense a graph into a subgraph node in another graph - this is useful for composite operators. With some restrictions, it is possible to mix execution models. For instance, a GraphDetail<ThreadExecutionModel> can be subsumed into a NodeDetail<CudaExecutionModel> in a GraphDetail<CudaExecutionModel>. This allows work to be performed simultaneously on CPU and GPU. The variable HostExecution_t determines which ExecutionModel (SerialExecutionModel or ThreadExecutionModel) should be used by CudaExecutionModel for host side operations.

The library also provides memory management classes:

MemoryRegion which contains the data, and a pointer to a MemoryResource
MemoryResource which governs how memory is allocated, copied and freed. It has several child classes, depending on the purpose of the MemoryRegions or operations that use it.

For instance, users may use

CudaPinnedMemory<CudaHostAllocWriter::HostWritten>> to stage host-side data that will be copied to a GPU,
then CudaMemory for all the intermediate device-side operations
and then CudaPinnedMemory<CudaHostAllocWriter::DeviceWritten>> for host-side data that is copied back from the host after execution.

It is also possible to perform interleaving during copies, in case the user wishes to use vectorisation.

Tests

Some basic tests have been added to UnitTests

Suggested reviewers

Please suggest any people who would be appropriate to review your code.

Notes

Please add any other information that could be useful for reviewers.

Checklist

Functions and classes, or changes to them, are documented.
User guide/documentation is updated.
Changelog is updated.
Suitable tests added for new functionality.
Contributed code is correctly formatted. (See the contributing guidelines).
License added to any new files.
No extraneous files have been added (e.g. compiler output or test data files).

Edited Mar 22, 2024 by James Edgeley

Feature/redesign/execution