Openfoam 1812 over Infiniband
Hello evryone,
Summary
I am trying to run OpenFoam 1812 over Infiniband (Mellanox) with OpenMPI 4 but it crashes at launch, I wonder if it is a compatibility issue between openfoam and openmpi. With a simple C code I can use openmpi over infiniband (I am not exchanging data with this code though)
Steps to reproduce
I have an openFoam case and use the following command:
foamJob -p -s snappyHexMesh
I add the following to my bashrc:
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx5_1:1"
Environment information
OpenFOAM version : v1812 Operating system : ubuntu 18.04 Hardware info : infiniband Mellanox Compiler : gcc
Possible fixes
I am using OpenMPI 4, but Openfoam doesn't accept it so I add a link from libmpi.so.40
to libmpi.so.20
because Openfoam is looking for the v2, but it is using the v4 with the link, for a calculation only on one server it is working perfectly (v4 faster than v2)
I tried to switch back to v2 but if I install it I get:
[maui:16468] PMIX ERROR: UNPACK-PAST-END in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/bfrops/v12/unpack.c at line 206
What is the current bug behaviour?
I get a segmentation fault from snappyHexMesh
cws@maui:~/Molokai/bench/run_32$ foamJob -p -s snappyHexMesh
Parallel processing using SYSTEMOPENMPI with 2 processors
Executing: /opt/openfoam1812/OpenFOAM-v1812/bin/mpirun -np 2 -hostfile hostfile -x FOAM_SETTINGS /opt/openfoam1812/OpenFOAM-v1812/bin/foamExec snappyHexMesh -parallel | tee log
[maui:22883] Warning: could not find environment variable "FOAM_SETTINGS"
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: oahu
--------------------------------------------------------------------------
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: v1812 |
| \\ / A nd | Web: www.OpenFOAM.com |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : v1812 OPENFOAM=1812
Arch : "LSB;label=32;scalar=64"
Exec : snappyHexMesh -parallel
Date : Jul 18 2019
Time : 10:43:48
Host : maui
PID : 22891
I/O : uncollated
[maui:22891:0:22891] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
0 /usr/lib/libucs.so.0(+0x1ec4c) [0x7fc7ab995c4c]
1 /usr/lib/libucs.so.0(+0x1eec4) [0x7fc7ab995ec4]
===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node maui exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------```