Implement DiagPrecon operator for CUDA
This MR implement the CUDA version of the DiagPrecon operator.
The StdMat version of the DiagPrecon operator has also been updated to use the redesign AssmbScat operator.
Note: The current CUDA implementation of the DiagPrecon operator apply the preconditioner of the GPU but the preconditioner is still computed on the CPU during the initialization (to be updated).
Edited by Jacques Xing