ENH: GAMG: processor agglomeration extended for all interfaces
cyclicAMI
in processorAgglomeration (inside GAMG)
Support for Problem description:
When running in parallel on many cores the coarsest level solver (e.g. PCG
) might become the scaling bottleneck. One way to get around this is to agglomerate coarse level matrices onto less or even one processor. This is collects processors' matrices to create a single, larger matrix with all the inter-processor boundaries replaced with internal faces. This is not supported for other boundary types, e.g. cyclicAMI. When used in combination with processor agglomeration these will display an error message when serialising the boundary, e.g.
[2] --> FOAM FATAL ERROR:
[2] Not implemented
[2]
[2] From virtual void Foam::cyclicAMIGAMGInterface::write(Foam::Ostream&) const
[2] in file AMIInterpolation/GAMG/interfaces/cyclicAMIGAMGInterface/cyclicAMIGAMGInterface.H at line 160.
[2]
FOAM parallel run aborting
Solution
The handling of boundaries was generalised to all coupled boundaries (e.g. cyclic
, cyclicAMI
). All will implement
- writing to / constructing from a stream (i.e. serialisation) to collect all parts of a coupled boundary on the agglomerating processor.
- cloning (on the agglomerating processor) from the received parts. For
cyclicAMI
this involves assembling the local face-to-cell addressing from the individual parts and adapting the stencils accordingly. (see below)
Effect
A case with two 20x10x1 blocks coupled through cyclicAMI
(decomposed into 4) was compared to a single 40x10x1 block (so using processor boundaries) using the masterCoarsest
processor agglomeration (output from running with the -debug-switch GAMGAgglomeration
command line option):
- processor boundaries:
nCells nInterfaces
Level nProcs avg max avg max
----- ------ --- --- --- ---
0 4 100 100 1.5 2
1 4 50 50 1.5 2
2 1 100 100 0 0
3 1 48 48 0 0
The number of boundaries ('nInterfaces') becomes 0 as all processor faces become internal.
- cyclicAMI boundaries:
nCells nInterfaces
Level nProcs avg max avg max
----- ------ --- --- --- ---
0 4 100 100 3 3
1 4 50 50 3 3
2 1 100 100 2 2
3 1 48 48 2 2
Here the number of boundaries goes from 3 to 2 since only the two cyclicAMI
get preserved.
Distributed cyclicAMI
A big benefit of cyclicAMI is that the source and target faces do not have to reside on the same processor. This is handled internally using a distribution map:
-
AMI.srcMap()
: transfers the source-side data to the target side. -
AMI.tgtMap()
: transfers the target-side data to the source side.
When assembling the cyclicAMI interface from the various contributing processors a large part is the assembling of the src and tgt maps. Each map consists of local data (myProcNo) followed by data from the various remote (proci != myProcNo) procs:
index | contents |
---|---|
0 | local |
localSize | |
remote from 0 | |
remote from 1 | |
.. | |
remote from n | |
constructSize |
The data is constructed by
- starting from the local data
- using the
subMap
to indicate which elements of this local data need to go where - making additional space to receive remote data
- using the
constructMap
to indicate where the received data slots into
If we start from four processors and combine the processors 0,1 into new 0 and 2,3 into new 1 the assembled layout is agglomerated:
- the local data is the agglomeration of local datas so local data on new0 is old0, old1
- the remote data is sorted according to originating 'new' processor (so new0 agglomerates the data sent to old procs 0,1 from old procs2,3)
- any remote data from assembled processors is removed (since it is now in the assembled local slots)
The two maps indexing the data will be renumbered accordingly. In general most maps will have lots of local data and just a bit of remote (note that this might not be optimal for cyclicAMI purposes since quite likely the two sides get decomposed onto separate processors) so the new numbering is
- startOfLocal[mapi] : gives for
mapi
(assumed to originate from rankmapi
) the offset in the assembled data - compactMaps[mapi][index] : gives the
mapi
the new index for every old index
Notes
-
cyclicAMI
with all faces becoming local will be reset to become non-distributed i.e. directly operating on provided fields without any additional copying. -
cyclicAMI
with a rotational transformation is not yet supported. This is not a fundamental limitation but requires additional rewriting of the stencils to take into account transformations. -
processorCyclic
(acyclic
with owner and neighbour cells on different processors) is not yet supported. This is treated as a normal processor boundary so will loose any transformation. Note thatprocessorCyclic
can be avoided by using thepatches
constraint in decomposeParDict, e.g.
constraints
{
patches
{
//- Keep owner and neighbour on same processor for faces in patches
// (only makes sense for cyclic patches and cyclicAMI)
type preservePatches;
patches (cyclic);
}
}
- only
masterCoarsest
has been tested but the code should support any other processor-agglomeration method.