Encapsulate kernels/operators to facilitate the building of higher-level fused operators.

Motivation

When building higher-level operators, the design below should be more efficient (better cache efficiency)

Fused structure

Operator(Field &in, Field &out)
{
    // Some initialization here...
    for (auto &block: in.GetBlocks())
    {
        // Some initialization here...
        for (size_t e = 0; e < MetaBlockNum; e++)
        {
            Oper0_kernel(inptr, tmp, JacPtr, m_B,...);
            // ...
            // more kernels...
            // ...
            OperN_kernel(tmp, outptr, dfPtr, m_D,...);
            // incremental pointers and subscripts here...
        }
    // incremental pointers and subscripts here...
    }
}

than this design:

Flat structure

Operator(Field &in, Field &out)
{
    //Operator0(in, tmp1);
    {
        // Some initialization...
        for (auto &block: in.GetBlocks())
        {
            // Some initialization...
            for (size_t e = 0; e < MetaBlock; e++)
            {
                Oper0_kernel(inptr, tmp, wsp, JacPtr, m_B,...);
                 // incremental pointers, subscripts, etc...
            }
        }
    }
    
    // more operators...
    
    //operatorN(tmp, out)
    {
       // Some initialization...
        for (auto &block: in.GetBlocks())
        {
            // Some initialization...
            for (size_t e = 0; e < MetaBlock; e++)
            {
                OperN_kernel(inptr, tmp, wsp, JacPtr, m_B,...);
                 // incremental pointers, subscripts, etc...
            }
        }
    }
}

The fused structure requires much smaller temporary workspaces and improves the cache efficiency, which is vital for CPU-end and therefore used in MatFree implementation.

In the current MatFree implementation, we have to handle all of the raw pointers, temporary output and workspace manually, which is complicated and fallible.

The redesign operator base class so far is only suitable for flat structure design. For example, HelmholtzCUDA and HelmholtzStdMat are built in a flat structure, which takes minimum effort.

In the fused structure, building a higher-level operator is essentially a puzzle game of re-arranging the kernel functions. So the question is: How to encapsulate kernels/operators so that we can easily build higher-level operators in the fused structure?

Challenges

How to preserve efficiency after encapsulation?
How to create a nested operator?

What should the Fused structure be like after the encapsulation?

Here is my proposal: (incomplete, draft)

Operator

Inputs and outputs of the sub-operators should be exposed to the users.
Basis data, geometric data, and internal workspace should be hidden inside each sub-operator.
Users have the right to decide whether sub-operators share the same geometric or basis data set.

Class UserOper
{

    UserOper(ExpList &expansionList)
    {
        auto blocks = GetBlockAttributes(FieldState::Phys, expansionList, vec_t::alignment);
        // Geom data set is created by linking to a explist.
        m_geomDataSet = VectorisedGeomData::create(expansionList, blocks)
        
        for (size block = 0; block < in.GetBlocks().size(); block++)
        {
            // find some way to instantiate templates here...
            m_operator0Kernel[block] = OperatorKernel_0<>(m_geomDataSet,...);
            m_operator1Kernel[block] = OperatorKernel_1<>(m_geomDataSet,...);
            // wsp handled by m_geomDataSet. to unify the interface, it always
            // return a workbench (e.g. m_wsp has only 1 metaBlock)
            m_wsp = m_geomDataSet->GetWSP(m_operator0Kernel[block]->GetOutputSize() );
        }
    }

    void Apply(Field &in, Field &out)
    {
        // Workbench will handle the data pointer and offset automatically.
        // users doesn'handle them directly.
        MultiBlockWorkbench inWB = in.GetWorkbench();
        MultiBlockWorkbench outWB = out.GetWorkbench();    
        for (size block = 0; block < in.GetBlocks().size(); block++)
        {
            // if operator uses geometric data. he will also handled
            m_operator0Kernel[block].initializeWorkbench(block);
            while (!inWorkbench.MetaBlockEnd())
            {
                // m_B, JacPtr, wsp, are handled internally
                m_operator0Kernel[block]->Operate(inWB.GetPtr(), m_wsp.GetPtr());
                // m_D, dfPtr, wsp, are handled internally
                m_operator1Kernel[block]->Operate(m_wsp.GetPtr(), outWB.GetPtr());
                // incremental pointers, subscripts, etc...
                inWB.moveToNextMetaBlock();
                outWB.moveToNextMetaBlock();
                m_operator0Kernel[block].moveToNextMetaBlock();
                m_operator1Kernel[block].moveToNextMetaBlock();
            }
            inWB.moveToNextBlock();
            inWB.moveToNextBlock();
        }
    }
private:
    VectorisedGeomDataSharedPtr m_geomDataSet;
    Workbench m_wsp;
}

Kernel

The original MatFree kernel functions are now warpped into Kernel class
Make sure the operate() can be inlined.
We can build higher-level kernels by including sub kernels as class members. Here is an example:

// The template parameter is dimensional-dependent. Set to 0 by default.
// BwdTransKernel<Seg, 4, 4>
// BwdTransKernel<Quad, 4, 5, 4, 5>
template <LibUtilities::ShapeType SHAPE_TYPE,
            int nm0, int nq0, 
            int nm1=0, int nq1=0, 
            int nm2=0, int nq2=0>
Class BwdTransKernel
{
    BwdTransKernel(MultiRegions::ExpListSharedPtr &expansionList,
                    std::vector<BlockAttributes> &blocks, int blockid, 
                    VectorisedGeomDataSharedPtr &VectorisedGeomData)
    {
        constexpr DIM = GetShapeDimension(SHAPE_TYPE);
        for (int d = 0; d < DIM; d++)
        {
            m_B[d] = VectorisedGeomData->GetB(blockid, d);
        }
        if constexpr (DIM == 2)
        {
            BwdTrans2DWorkspace<SHAPE_TYPE>(nm0, nm1, nq0, nq1, wsp0Size);
            m_wsp0 = VectorisedGeomData->GetWSP(wsp0Size, 0);
        }
        if constexpr (DIM == 3)
        {
            BwdTrans3DWorkspace<SHAPE_TYPE>(nm0, nm1, nm2, nq0, nq1, nq2, 
                wsp0Size, wsp1Size);
            // differernt wsp is marked with id 0,1,2...
            // wsp with different id will never overlap.
            // wsp with same id may overlap.
            // if no id is given, allocate new workspace
            m_wsp0 = VectorisedGeomData->GetWSP(wsp0Size, 0);
            m_wsp1 = VectorisedGeomData->GetWSP(wsp0Size, 1);
        }
    }
    
    inline Operate(TData *in, TData *out)
    {
        constexpr DIM = GetShapeDimension(SHAPE_TYPE);
        if constexpr (DIM == 1)
        {
            // also force inline
            BwdTrans1DKernel<SHAPE_TYPE>(nm0, nq0, in, *(m_B[0]), out);
        }
        if constexpr (DIM == 2)
        {
            // also force inline
            BwdTrans2DKernel<SHAPE_TYPE>(nm0, nm1, nq0, nq1, correct, in,
                                         *(m_B[0]), *(m_B[1]), m_wsp0.GetPtr(), out);
        }
        if constexpr (DIM == 3)
        {
            // also force inline
            BwdTrans3DKernel<SHAPE_TYPE>(nm0, nm1, nm2, nq0, nq1, nq2, correct,
                                         in, *(m_B[0]), *(m_B[1]), *(m_B[2]), 
                                         m_wsp0.GetPtr(), m_wsp1.GetPtr(), out;
        }
    }

private:
    std::array< std::shared_ptr<VecVec_t>, 3> m_B;
    Workbench m_wsp0, m_wsp1;
}

template <LibUtilities::ShapeType SHAPE_TYPE, bool DEFORMED, 
            int nm0, int nq0, 
            int nm1=0, int nq1=0, 
            int nm2=0, int nq2=0>
Class IProductKernel
{
    IProductKernel(MultiRegions::ExpListSharedPtr &expansionList,
                    std::vector<BlockAttributes> &blocks, int blockid, 
                    VectorisedGeomDataSharedPtr &VectorisedGeomData)
    {
        constexpr DIM = GetShapeDimension(SHAPE_TYPE);
        for (int d = 0; d < DIM; d++)
        {
            m_B[d] = VectorisedGeomData->GetB(basisKey?);
        }
        m_Jac = VectorisedGeomData->GetJ(blockid);
    }
    
    inline Operate(TData *in, TData *out)
    {
        constexpr DIM = GetShapeDimension(SHAPE_TYPE);
        if constexpr (DIM == 1)
        {
            IProduct1DKernel<SHAPE_TYPE, false, false, DEFORMED>(
                nm0, nq0, in, *(m_B[0]), *(m_w[0]), m_Jac.GetPtr(),  out);
        }
        if constexpr (DIM == 2)
        {
        }
        if constexpr (DIM == 3)
        {
        }
    }

    inline initiailizeWorkbench(blockid)
    {
        m_Jac.initialize(blockid);
    }

    inline moveToNextMetaBlock()
    {
        m_Jac.moveToNextMetaBlock();
    }

private:
    std::array< std::shared_ptr<VecVec_t>, 3> m_B;
    Workbench m_wsp;
    SingleBlockWorkbench m_Jac;
}


template <LibUtilities::ShapeType SHAPE_TYPE, bool DEFORMED, 
            int nm0, int nq0, 
            int nm1=0, int nq1=0, 
            int nm2=0, int nq2=0>
Class MassMatrixKernels
{
public:
    MassMatrixKernel(MultiRegions::ExpListSharedPtr &expansionList,
                    std::vector<BlockAttributes> &blocks, int blockid, VectorisedGeomDataSharedPtr &VectorisedGeomData)
    {
        m_iproduct = IProductKernel<>();
        m_bwdtrans = BwdTransKernel<>();

        tmp = VectorisedGeomData->GetWSP( m_bwdtrans->GetOutputSize() );
    }
    
    inline Operate(TData *in, TData *out)
    {
        m_bwdtrans->Operate(in, tmp.GetPtr());
        m_iproduct->Operate(tmp.GetPtr(), out);
    }

private:
    IProductKernel m_iproduct;
    BwdTransKernel m_bwdtrans;
    Workbench tmp;
}

Workbench

It's a general placeholder for data. It should be as simple as possible.
The Workbench class is for single metablock, such as wsp
The SingleBlockWorkbench class is for single block, such as m_Jac
The MultiBlockWorkbench class is for multiblock such as Field

Class Workbench
{
public:
    Workbench()
    {
    }

    inline void initialize(TData *dataPtr, size_t metaBlockSize)
    {
        m_dataPtr = dataPtr;
        m_metaBlockSize = metaBlockSize;
    }

    inline TData *GetPtr()
    {
        return m_dataPtr;
    }

    inline size_t *GetmetaBlockSize()
    {
        return m_metaBlockSize;
    }

    TData *m_dataPtr;
    size_t m_metaBlockSize;
}

Class SingleBlockWorkbench
{
public:
    SingleBlockWorkbench() : public Workbench
    {
    }

    inline void initialize(TData *dataPtr, size_t metaBlockNum, size_t metaBlockSize)
    {
        m_blockPtr = dataPtr;
        m_metaBlockNum = metaBlockNum;
        m_metaBlockSize = metaBlockSize;

        m_dataPtr = m_blockPtr;
        m_metaBlockID = 0;
    }

    inline void returnToFirstMetaBlock()
    {
        m_dataPtr = m_blockPtr;
        m_metaBlockID = 0;
    }

    inline void moveToNextMetaBlock()
    {
        m_dataPtr += m_metaBlockSize;
        m_metaBlockID++;
    }

    inline bool isMetaBlockEnd()
    {
        return m_metaBlockID == m_nMetaBlocks - 1;
    }

private:
    TData *m_blockPtr;
    size_t m_metaBlockID;
    size_t m_metaBlockNum;
}

Class MultiBlockWorkbench : public SingleBlockWorkbench
{
public:
    MultiBlockWorkbench()
    {
    }

    Create(TData *dataPtr, std::vector<BlockAttributes> &blocks, size_t width)
    {
        m_fieldPtr = dataPtr;
        m_multiBlockInfo.resize( blocks.size() );

        for (it = m_multiBlockInfo.begin(); it != m_multiBlockInfo.end(); it++)
        {
            // metaBlockSize
            std::Get<1>(*it) = blocks[i].num_Pts * width;
            
            if (width > 1)
            {
                // metaBlockNum
                std::Get<2>(*it) = (blocks[i].num_elements + blocks[i].num_padding_elements) / width;            
            }
            else // width == 1 scalar, no padding included
            {
                std::Get<2>(*it) = blocks[i].num_elements;
            }
            
            if (i>0)
            {
                std::Get<0>(*it) = std::Get<0>(*(it-1)) + blocks[i].blocksize;
            }
            else
            {
                std::Get<0>(*it) = 0;
            }
        }
        // call base method to initialize wb to first block;
        thisBlock = m_multiBlockInfo.data();
        initialize(m_fieldPtr, std::get<2>(*thisBlock), std::get<1>(*thisBlock) );
    }

    inline void moveToNextBlock()
    {
        thisBlock++;
        // call base method to initialize wb to certain block;
        initialize( m_fieldPtr + std::get<0>(*thisBlock),
             std::get<2>(*thisBlock), std::get<1>(*thisBlock) );
    }

    inline void moveToBlock(size_t blockid)
    {
        thisBlock = &m_multiBlockInfo[blockid];
        initialize( m_fieldPtr + std::get<0>(*thisBlock),
             std::get<2>(*thisBlock), std::get<1>(*thisBlock) ); );        
    }

    inline bool isBlockEnd()
    {
        return thisBlock == m_multiBlockInfo.end();
    }

private;
    TData *m_fieldPtr;
    // typedef std::tuple<size_t, size_t, size_t> BlockInfo
    // = {BlockOffset, metaBlockSize, metaBlockNum};
    BlockInfo *thisBlock;
    std::vector<BlockInfo> m_multiBlockInfo;
}