Skip to content

solver crashes when running in SPDP mode

Summary

For the same case, when running in SPDP mode, it can run normally on some HPCs, while it crashes on others. Moreover, the number of cores used also affects whether the crash occurs or not. The general pattern is that the more cores used, the higher the probability of a crash occurring.

Steps to reproduce

"Prepare a case with a very large number of cells, and run it using more than 512 cores. Generate the mesh using snappyHexMesh in DP mode, then switch to SPDP mode to run the solver."

Example case

The occurrence of this bug is random, therefore it's not possible to provide a specific case that guarantees reproduction. The larger the case and the more cores used, the higher the probability of its occurrence.

What is the current bug behaviour?

I was running an aerodynamics simulation case on a supercomputer, with a grid of approximately 70 million cells, using 512 cores. The mesh was generated using snappyHexMesh in DP mode, then I switched to SPDP mode to run simpleFoam, and the program crashed. The crash information is as follows:

[238] #2  ? in /lib64/libpthread.so.0
[238] #3  Foam::primitiveMeshTools::makeCellCentresAndVols(Foam::primitiveMesh const&, Foam::Field<Foam::Vector<float> > const&, Foam::Field<Foam::Vector<float> > const&, Foam::Field<Foam::Vector<float> >&, Foam::Field<float>&) at ??:?
[238] #4  Foam::stabilisedFvGeometryScheme::movePoints() at ??:?
[238] #5  Foam::fvGeometryScheme::adddictConstructorToTable<Foam::stabilisedFvGeometryScheme>::New(Foam::fvMesh const&, Foam::dictionary const&) at ??:?
[238] #6  Foam::fvGeometryScheme::New(Foam::fvMesh const&, Foam::dictionary const&, Foam::word const&) at ??:?
[238] #7  Foam::surfaceInterpolation::geometry() const at ??:?
[238] #8  Foam::fvMesh::init(bool) at ??:?
[238] #9  main at ??:?
[238] #10  __libc_start_main in /lib64/libc.so.6
[238] #11  ? at ??:?
[x5104:56977:0:56977] Caught signal 8 (Floating point exception: tkill(2) or tgkill(2))
==== backtrace (tid:  56977) ====
 0 0x000000000000f4fb raise()  ???:0
 1 0x00000000007bcb87 Foam::primitiveMeshTools::makeCellCentresAndVols()  ???:0
 2 0x000000000091840a Foam::stabilisedFvGeometryScheme::movePoints()  ???:0
 3 0x0000000000918775 Foam::fvGeometryScheme::adddictConstructorToTable<Foam::stabilisedFvGeometryScheme>::New()  ???:0
 4 0x00000000008fda2a Foam::fvGeometryScheme::New()  ???:0
 5 0x000000000091a818 Foam::surfaceInterpolation::geometry()  ???:0
 6 0x000000000089900e Foam::fvMesh::init()  ???:0
 7 0x000000000042535b main()  ???:0
 8 0x0000000000022555 __libc_start_main()  ???:0
 9 0x000000000042a760 _start()  ???:0
=================================

For the same case, if I ran simpleFoam in DP mode, no crash happened.

What is the expected correct behavior?

simpleFoam can also run normally in SPDP mode

Environment information

  • OpenFOAM version : V2012
  • Operating system : Centos 7.6
  • Hardware info : AMD EPYC series
  • Compiler : gcc 7.5

Possible fixes

From the backtrace information above, we can see that the program crashes when calling Foam::stabilisedFvGeometryScheme::movePoints(), and Foam::primitiveMeshTools::makeCellCentresAndVols() is called inside stabilisedFvGeometryScheme::movePoints(). Some important code snippets are listed below.

Foam::stabilisedFvGeometryScheme::movePoints()
{
        vectorField cellCentres(mesh_.nCells());
        scalarField cellVolumes(mesh_.nCells());

        primitiveMeshTools::makeCellCentresAndVols
        (
            mesh_,
            faceCentres,
            faceAreas,
            cellCentres,
            cellVolumes
        );

}


void Foam::primitiveMeshTools::makeCellCentresAndVols
(
    const primitiveMesh& mesh,
    const vectorField& fCtrs,
    const vectorField& fAreas,
    vectorField& cellCtrs_s,
    scalarField& cellVols_s
)
{
    typedef Vector<solveScalar> solveVector;

    PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s);
    Field<solveScalar>& cellVols = tcellVols.ref();
    PrecisionAdaptor<solveVector, vector> tcellCtrs(cellCtrs_s);
    Field<solveVector>& cellCtrs = tcellCtrs.ref();

    // Clear the fields for accumulation
    cellCtrs = Zero;
    cellVols = 0.0;
}

// PrecisionAdaptor

//- A non-const Field/List wrapper with possible data conversion
template<class Type, class InputType, template<class> class Container = Field>
class PrecisionAdaptor
:
    public refPtr<Container<Type>>
{
    // Private Data

        //- Reference to underlying (input) data
        Container<InputType>& ref_;


    // Private Member Functions

        //- Copy in field
        void copyInput(const Container<InputType>& input, const bool copy)
        {
            this->reset(new Container<Type>(input.size()));
            if (copy)
            {
                std::copy(input.cbegin(), input.cend(), this->ref().begin());
            }
        }


public:

    //- The adapted field type
    typedef Container<Type> FieldType;


    // Constructors

        //- Construct from Container<InputType>, copying on input if required
        PrecisionAdaptor(Container<InputType>& input, const bool copy = true)
        :
            refPtr<Container<Type>>(),
            ref_(input)
        {
            if (std::is_same<Type, InputType>::value)
            {
                // Use non-const reference directly
                this->ref(reinterpret_cast<FieldType&>(ref_));
            }
            else
            {
                this->copyInput(input, copy);
            }
        }


    //- Destructor, copy back on destroy
    ~PrecisionAdaptor()
    {
        if (this->is_pointer())
        {
            const FieldType& store = this->cref();
            ref_.resize(store.size());  // extra safety
            std::copy(store.cbegin(), store.cend(), ref_.begin());
        }
    }
};

In the function movePoints , variables cellCentres and cellVolumes are defined, and then these two variables are passed as parameters into makeCellCentresAndVols. In the function makeCellCentresAndVols, the line PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s) constructs an object of the PrecisionAdaptor class, with cellVols_s as the parameter for the constructor. Examining the constructor of PrecisionAdaptor, we find that when solveScalar and scalar are not the same type, the copyInput function is called, which copies the variable cellVols_s. cellVols_s is of scalar type, and after copying, the resulting variable is of solveScalar type.

The problem occurs in this copyInput function. Because the variables cellCentres and cellVolumes are not given initial values when they are defined, on some hardware platforms, the initial values of these two variables are random. This leads to unpredictable behavior when copying the values of these two variables to the solveScalar type.

There are two ways to solve this bug. One is to specify initial values when defining the variables cellCentres and cellVolumes in the function stabilisedFvGeometryScheme::movePoints().

//vectorField cellCentres(mesh_.nCells());
//scalarField cellVolumes(mesh_.nCells());

vectorField cellCentres(mesh_.nCells(), Zero);
scalarField cellVolumes(mesh_.nCells(), Zero);

The other is to cancel the copying by specifying copy=false when constructing the PrecisionAdaptor object in primitiveMeshTools::makeCellCentresAndVols. This copying behavior is unnecessary in makeCellCentresAndVols, because the purpose of this function is to calculate cellCentres and cellVolumes. It will definitely reassign values to these two variables, so there's no need to copy their initial values.

//PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s);
PrecisionAdaptor<solveScalar, scalar> tcellVols(cellVols_s, false);
Field<solveScalar>& cellVols = tcellVols.ref();
//PrecisionAdaptor<solveVector, vector> tcellCtrs(cellCtrs_s);
PrecisionAdaptor<solveVector, vector> tcellCtrs(cellCtrs_s, false);
Field<solveVector>& cellCtrs = tcellCtrs.ref();