solver crashes when running in SPDP mode
Summary
For the same case, when running in SPDP mode, it can run normally on some HPCs, while it crashes on others. Moreover, the number of cores used also affects whether the crash occurs or not. The general pattern is that the more cores used, the higher the probability of a crash occurring.
Steps to reproduce
"Prepare a case with a very large number of cells, and run it using more than 512 cores. Generate the mesh using snappyHexMesh in DP mode, then switch to SPDP mode to run the solver."
Example case
The occurrence of this bug is random, therefore it's not possible to provide a specific case that guarantees reproduction. The larger the case and the more cores used, the higher the probability of its occurrence.
What is the current bug behaviour?
I was running an aerodynamics simulation case on a supercomputer, with a grid of approximately 70 million cells, using 512 cores. The mesh was generated using snappyHexMesh in DP mode, then I switched to SPDP mode to run simpleFoam
, and the program crashed. The crash information is as follows:
[238] #2 ? in /lib64/libpthread.so.0
[238] #3 Foam::primitiveMeshTools::makeCellCentresAndVols(Foam::primitiveMesh const&, Foam::Field<Foam::Vector<float> > const&, Foam::Field<Foam::Vector<float> > const&, Foam::Field<Foam::Vector<float> >&, Foam::Field<float>&) at ??:?
[238] #4 Foam::stabilisedFvGeometryScheme::movePoints() at ??:?
[238] #5 Foam::fvGeometryScheme::adddictConstructorToTable<Foam::stabilisedFvGeometryScheme>::New(Foam::fvMesh const&, Foam::dictionary const&) at ??:?
[238] #6 Foam::fvGeometryScheme::New(Foam::fvMesh const&, Foam::dictionary const&, Foam::word const&) at ??:?
[238] #7 Foam::surfaceInterpolation::geometry() const at ??:?
[238] #8 Foam::fvMesh::init(bool) at ??:?
[238] #9 main at ??:?
[238] #10 __libc_start_main in /lib64/libc.so.6
[238] #11 ? at ??:?
[x5104:56977:0:56977] Caught signal 8 (Floating point exception: tkill(2) or tgkill(2))
==== backtrace (tid: 56977) ====
0 0x000000000000f4fb raise() ???:0
1 0x00000000007bcb87 Foam::primitiveMeshTools::makeCellCentresAndVols() ???:0
2 0x000000000091840a Foam::stabilisedFvGeometryScheme::movePoints() ???:0
3 0x0000000000918775 Foam::fvGeometryScheme::adddictConstructorToTable<Foam::stabilisedFvGeometryScheme>::New() ???:0
4 0x00000000008fda2a Foam::fvGeometryScheme::New() ???:0
5 0x000000000091a818 Foam::surfaceInterpolation::geometry() ???:0
6 0x000000000089900e Foam::fvMesh::init() ???:0
7 0x000000000042535b main() ???:0
8 0x0000000000022555 __libc_start_main() ???:0
9 0x000000000042a760 _start() ???:0
=================================
For the same case, if I ran simpleFoam in DP mode, no crash happened.
What is the expected correct behavior?
simpleFoam can also run normally in SPDP mode
Environment information
- OpenFOAM version : V2012
- Operating system : Centos 7.6
- Hardware info : AMD EPYC series
- Compiler : gcc 7.5
Possible fixes
From the backtrace information above, we can see that the program crashes when calling Foam::stabilisedFvGeometryScheme::movePoints()
, and Foam::primitiveMeshTools::makeCellCentresAndVols()
is called inside stabilisedFvGeometryScheme::movePoints()
. Some important code snippets are listed below.
Foam::stabilisedFvGeometryScheme::movePoints()
{
vectorField cellCentres(mesh_.nCells());
scalarField cellVolumes(mesh_.nCells());
primitiveMeshTools::makeCellCentresAndVols
(
mesh_,
faceCentres,
faceAreas,
cellCentres,
cellVolumes
);
}
void Foam::primitiveMeshTools::makeCellCentresAndVols
(
const primitiveMesh& mesh,
const vectorField& fCtrs,
const vectorField& fAreas,
vectorField& cellCtrs_s,
scalarField& cellVols_s
)
{
typedef Vector<solveScalar> solveVector;
PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s);
Field<solveScalar>& cellVols = tcellVols.ref();
PrecisionAdaptor<solveVector, vector> tcellCtrs(cellCtrs_s);
Field<solveVector>& cellCtrs = tcellCtrs.ref();
// Clear the fields for accumulation
cellCtrs = Zero;
cellVols = 0.0;
}
// PrecisionAdaptor
//- A non-const Field/List wrapper with possible data conversion
template<class Type, class InputType, template<class> class Container = Field>
class PrecisionAdaptor
:
public refPtr<Container<Type>>
{
// Private Data
//- Reference to underlying (input) data
Container<InputType>& ref_;
// Private Member Functions
//- Copy in field
void copyInput(const Container<InputType>& input, const bool copy)
{
this->reset(new Container<Type>(input.size()));
if (copy)
{
std::copy(input.cbegin(), input.cend(), this->ref().begin());
}
}
public:
//- The adapted field type
typedef Container<Type> FieldType;
// Constructors
//- Construct from Container<InputType>, copying on input if required
PrecisionAdaptor(Container<InputType>& input, const bool copy = true)
:
refPtr<Container<Type>>(),
ref_(input)
{
if (std::is_same<Type, InputType>::value)
{
// Use non-const reference directly
this->ref(reinterpret_cast<FieldType&>(ref_));
}
else
{
this->copyInput(input, copy);
}
}
//- Destructor, copy back on destroy
~PrecisionAdaptor()
{
if (this->is_pointer())
{
const FieldType& store = this->cref();
ref_.resize(store.size()); // extra safety
std::copy(store.cbegin(), store.cend(), ref_.begin());
}
}
};
In the function movePoints
, variables cellCentres
and cellVolumes
are defined, and then these two variables are passed as parameters into makeCellCentresAndVols
. In the function makeCellCentresAndVols
, the line PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s)
constructs an object of the PrecisionAdaptor
class, with cellVols_s
as the parameter for the constructor. Examining the constructor of PrecisionAdaptor
, we find that when solveScalar
and scalar
are not the same type, the copyInput
function is called, which copies the variable cellVols_s
. cellVols_s
is of scalar
type, and after copying, the resulting variable is of solveScalar
type.
The problem occurs in this copyInput
function. Because the variables cellCentres
and cellVolumes
are not given initial values when they are defined, on some hardware platforms, the initial values of these two variables are random. This leads to unpredictable behavior when copying the values of these two variables to the solveScalar
type.
There are two ways to solve this bug. One is to specify initial values when defining the variables cellCentres
and cellVolumes
in the function stabilisedFvGeometryScheme::movePoints()
.
//vectorField cellCentres(mesh_.nCells());
//scalarField cellVolumes(mesh_.nCells());
vectorField cellCentres(mesh_.nCells(), Zero);
scalarField cellVolumes(mesh_.nCells(), Zero);
The other is to cancel the copying by specifying copy=false when constructing the PrecisionAdaptor
object in primitiveMeshTools::makeCellCentresAndVols
. This copying behavior is unnecessary in makeCellCentresAndVols
, because the purpose of this function is to calculate cellCentres
and cellVolumes
. It will definitely reassign values to these two variables, so there's no need to copy their initial values.
//PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s);
PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s, false);
Field<solveScalar>& cellVols = tcellVols.ref();
//PrecisionAdaptor<solveVector, vector> tcellCtrs(cellCtrs_s);
PrecisionAdaptor<solveVector, vector> tcellCtrs(cellCtrs_s, false);
Field<solveVector>& cellCtrs = tcellCtrs.ref();