Incorrect MPI used by CMake
I use OpenSUSE 15.4 and the MPI is selected through the module system. System packages are compiled using OpenMPI 1, but I use OpenMPI 2 through the configuration module.
Namely, there is the system OpenMPI 1 with /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.12
However, the OpenMPI 2 is selected through
> module list
Currently Loaded Modules:
1) gnu/7 2) openmpi/2.1.6 3) phdf5/1.12.2 4) ptscotch/6.1.0
> mpi-selector --query
default:openmpi2
level:user
> mpicc --show
gcc -I/usr/lib/hpc/gnu7/mpi/openmpi/2.1.6/include -pthread -L/usr/lib64 -L/usr/lib/hpc/gnu7/mpi/openmpi/2.1.6/lib64 -lmpi
> mpirun -version
mpirun (Open MPI) 2.1.6.0.d9b9e59e5c5a
However, when configuring the compilation with ccmake
, I can only find selection of USE_MPI or not, no MPI directory or similar. Then, during compilation and linking I get
Linking CXX executable ShallowWaterSolver
cd /home/lada/f/nektar-build/solvers/ShallowWaterSolver && /usr/bin/cmake -E cmake_link_script CMakeFiles/ShallowWaterSolver.dir/link.txt --verbose=1
/usr/lib/hpc/compiler/gnu/7/bin/c++ -O3 -DNDEBUG -DNEKTAR_RELEASE -pthread CMakeFiles/ShallowWaterSolver.dir/ShallowWaterSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/EquationSystems/ShallowWaterSystem.cpp.o CMakeFiles/ShallowWaterSolver.dir/EquationSystems/LinearSWE.cpp.o CMakeFiles/ShallowWaterSolver.dir/EquationSystems/MMFSWE.cpp.o CMakeFiles/ShallowWaterSolver.dir/EquationSystems/NonlinearSWE.cpp.o CMakeFiles/ShallowWaterSolver.dir/EquationSystems/NonlinearPeregrine.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/LinearSWESolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/NonlinearSWESolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/NoSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/LinearAverageSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/LinearHLLSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/AverageSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/LaxFriedrichsSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/HLLSolver.cpp.o CMakeFiles/ShallowWaterSolver.dir/RiemannSolvers/HLLCSolver.cpp.o -o ShallowWaterSolver -Wl,-rpath,/home/lada/f/nektar-build/dist/lib64/nektar++:/home/lada/f/nektar-build/library/SolverUtils:/home/lada/f/nektar-build/library/FieldUtils:/home/lada/f/nektar-build/library/GlobalMapping:/home/lada/f/nektar-build/library/MultiRegions:/home/lada/f/nektar-build/library/Collections:/home/lada/f/nektar-build/library/LocalRegions:/home/lada/f/nektar-build/library/SpatialDomains:/home/lada/f/nektar-build/library/StdRegions:/usr/lib/hpc/gnu7/openmpi2/ptscotch/6.1.0/lib64:/home/lada/f/nektar-build/library/MatrixFreeOps:/home/lada/f/nektar-build/library/LibUtilities:/usr/lib64/mpi/gcc/openmpi/lib64: ../../library/SolverUtils/libSolverUtils.so.5.3.0 ../../library/FieldUtils/libFieldUtils.so.5.3.0 ../../library/GlobalMapping/libGlobalMapping.so.5.3.0 ../../library/MultiRegions/libMultiRegions.so.5.3.0 ../../library/Collections/libCollections.so.5.3.0 ../../library/LocalRegions/libLocalRegions.so.5.3.0 ../../library/SpatialDomains/libSpatialDomains.so.5.3.0 ../../library/StdRegions/libStdRegions.so.5.3.0 /usr/lib/hpc/gnu7/openmpi2/ptscotch/6.1.0/lib64/libptscotch.so /usr/lib/hpc/gnu7/openmpi2/ptscotch/6.1.0/lib64/libptscotcherr.so ../../library/MatrixFreeOps/libMatrixFreeOps.so.5.3.0 ../../library/LibUtilities/libLibUtilities.so.5.3.0 /usr/lib64/libfftw3.so /usr/lib64/libboost_thread.so /usr/lib64/libboost_iostreams.so /usr/lib64/libboost_program_options.so /usr/lib64/libboost_filesystem.so /usr/lib64/libboost_system.so /usr/lib64/libboost_regex.so /usr/lib64/libz.so ../../ThirdParty/dist/lib/libtinyxml.a -lrt /usr/lib64/mpi/gcc/openmpi/lib64/libmpi_cxx.so /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so /usr/lib64/liblapack.so /usr/lib64/libblas.so /usr/lib/hpc/gnu7/openmpi2/hdf5/1.12.2/lib64/libhdf5.so /usr/lib/hpc/gnu7/openmpi2/hdf5/1.12.2/lib64/libhdf5_hl.so /usr/lib/hpc/gnu7/openmpi2/hdf5/1.12.2/lib64/libhdf5.so /usr/lib/hpc/gnu7/openmpi2/hdf5/1.12.2/lib64/libhdf5_hl.so /usr/lib64/libvtkIOLegacy.so.9.2.5 /usr/lib64/libvtkIOXML.so.9.2.5 /usr/lib64/libvtkIOXMLParser.so.9.2.5 /usr/lib64/libvtkIOCore.so.9.2.5 /usr/lib64/libvtkFiltersCore.so.9.2.5 /usr/lib64/libvtkCommonExecutionModel.so.9.2.5 /usr/lib64/libvtkCommonDataModel.so.9.2.5 /usr/lib64/libvtkCommonMisc.so.9.2.5 /usr/lib64/libvtkCommonTransforms.so.9.2.5 /usr/lib64/libvtkCommonMath.so.9.2.5 /usr/lib64/libvtkkissfft.so.9.2.5 /usr/lib64/libvtkCommonCore.so.9.2.5 -lpthread /usr/lib64/libvtksys.so.9.2.5 -ldl
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: warning: libmpi.so.20, needed by /usr/lib/hpc/gnu7/openmpi2/ptscotch/6.1.0/lib64/libptscotch.so, may conflict with libmpi.so.12
it links the incorrect /usr/lib64/mpi/gcc/openmpi/lib64/libmpi_cxx.so /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so
which is libmpi.so.12
and only complains that the one needed by the ptscotch library is of a higher version.
I do not know how to persuade the CMake to use the right MPI version. The result is then that all tests in ctest
fail due to an error in MPI_INIT due the incompatibilities. I get this kind of error when running /home/lada/f/nektar-build/utilities/NekMesh/NekMesh -m bl:surf=74 -m jac:list bl_quad.xml bl_quad-out.xml:xml:test
.
--------------------------------------------------------------------------
1: It looks like MPI_INIT failed for some reason; your parallel process is
1: likely to abort. There are many reasons that a parallel process can
1: fail during MPI_INIT; some of which are due to configuration or environment
1: problems. This failure appears to be an internal failure; here's some
1: additional information (which may only be relevant to an Open MPI
1: developer):
1:
1: ompi_mpi_init: ompi_rte_init failed
1: --> Returned "(null)" (-43) instead of "Success" (0)
1: --------------------------------------------------------------------------
1:
1: === Errors ===
1: *** An error occurred in MPI_Init
1: *** on a NULL communicator
1: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
1: *** and potentially your MPI job)
I get a similar ones multiple times when I run it using /usr/lib64/mpi/gcc/openmpi2/bin/mpirun
> /usr/lib64/mpi/gcc/openmpi/bin/mpirun -n 1 /home/lada/f/nektar-build/utilities/NekMesh/NekMesh -m bl:surf=74 -m jac:list bl_quad.xml bl_quad-out.xml:xml:test
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[meop37.suse:25916] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[54239,1],0]
Exit code: 1
--------------------------------------------------------------------------
> /usr/lib64/mpi/gcc/openmpi2/bin/mpirun -n 1 /home/lada/f/nektar-build/utilities/NekMesh/NekMesh -m bl:surf=74 -m jac:list bl_quad.xml bl_quad-out.xml:xml:test
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: meop37.suse
Framework: ess
Component: pmi
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_base_open failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[meop37.suse:25944] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[meop37.suse:25944] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[54198,1],0]
Exit code: 1
--------------------------------------------------------------------------