Crash using ParticleCollector in parallel
Summary
When using the ParticleCollector cloud function in parallel the simulation may crash. The crash is dependent on the number of processors used and the memory state of the computer.
Steps to reproduce
I have been able to replicate this problem on multiple computers using the splashPanel tutorial (tutorials/lagrangian/reactingParcelFoam/splashPanel). To get the crash behaviour this tutorial needs to be run with at least 10 processors. The small number of cells in the wall region with this number of processors is no the cause of the crash. I have had this problem on much bigger cases that I cannot share.
Example case
What is the current bug behaviour?
When the bug is encountered you will get an error message similar to that shown below. Note the processor that the problem occurs on can be different between crashes.
[3] --> FOAM FATAL IO ERROR: (openfoam-2012 patch=210618)
[3] Wrong token type - expected scalar value, found on line 3: punctuation '-'
[3]
[3] file: /tmp/06_splashPLanel_parallel/processor3/0/uniform/lagrangian/reactingCloud1/reactingCloud1OutputProperties.cloudFunctionObject.particleCollector1.massTotal at line 3.
[3]
[3] From Foam::Istream& Foam::operator>>(Foam::Istream&, Foam::doubleScalar&)
[3] in file lnInclude/Scalar.C at line 154.
[3]
FOAM parallel run exiting
What is the expected correct behavior?
Correct behaviour is the cloud function calculate the massTotal and MassFlowRate without crashing.
Environment information
- OpenFOAM version : v2106 and at least v2012. Older versions have not been tested.
- Operating system : ubuntu and centos
- Hardware info :
- Compiler : gcc
Possible fixes
The line number below refer to the v2016 version of src/lagrangian/intermediate/submodels/CloudFunctionObjects/ParticleCollector/ParticleCollector.C.
The problem is caused by some processors getting a -nan for the massTotal and/or massFlowRate. These values get written to the processor uniform/lagrangian/reactingCloud1/reactingCloud1OutputProperties dictionary. This results in a read failure at lines 427 and 430 on the next timestep.
The reason for the nan's. The massTotal and massFlowRate are calculated for each collector's face in the loop on lines 435-459. The scalar lists allProcMass and allProcMassFlowRate are not initialised to any value at declaration. i.e. they get whatever is in the memory location. A gather function is then used on these lists. After the gather, every processor only knows its own data and that of the processors below it. This means only the master process has the complete list. Other processors lists may still have junk numbers from initialisation. The sum operations performed on lines 440 and 445 will sum the local list for each processor. Noting only the master has the correct list. The correct values are reported because of the Info<< usage on line 461. Lines 497 and 498 write the values for the massTotal and massFlowRate into the reactingCloud1OutputProperties dictionary for each processor, potentially causing the read problem on the next timestep.
Proposed solution: There are a couple of ways to fix this. One would be to scatter the list after the gather, making sure all processors have the same information. Since it appears it is only the master that needs this information, an alternative solution is to initialise the allProcMass and allProcMassFlowRate lists to zero at construction. That way the sums (lines 440 and 445) will not result in large (overflow) numbers.