Skip to content

ENH: GAMG: processor agglomeration extended for all interfaces

Support for cyclicAMI in processorAgglomeration (inside GAMG)

Problem description:

When running in parallel on many cores the coarsest level solver (e.g. PCG) might become the scaling bottleneck. One way to get around this is to agglomerate coarse level matrices onto less or even one processor. This is collects processors' matrices to create a single, larger matrix with all the inter-processor boundaries replaced with internal faces. This is not supported for other boundary types, e.g. cyclicAMI. When used in combination with processor agglomeration these will display an error message when serialising the boundary, e.g.

[2] --> FOAM FATAL ERROR:
[2] Not implemented
[2]
[2]     From virtual void Foam::cyclicAMIGAMGInterface::write(Foam::Ostream&) const
[2]     in file AMIInterpolation/GAMG/interfaces/cyclicAMIGAMGInterface/cyclicAMIGAMGInterface.H at line 160.
[2]
FOAM parallel run aborting

Solution

The handling of boundaries was generalised to all coupled boundaries (e.g. cyclic, cyclicAMI). All will implement

  • writing to / constructing from a stream (i.e. serialisation) to collect all parts of a coupled boundary on the agglomerating processor.
  • cloning (on the agglomerating processor) from the received parts. For cyclicAMI this involves assembling the local face-to-cell addressing from the individual parts and adapting the stencils accordingly. (see below)

Effect

A case with two 20x10x1 blocks coupled through cyclicAMI (decomposed into 4) was compared to a single 40x10x1 block (so using processor boundaries) using the masterCoarsest processor agglomeration (output from running with the -debug-switch GAMGAgglomeration command line option):

  • processor boundaries:
                              nCells       nInterfaces
   Level  nProcs         avg     max       avg     max
   -----  ------         ---     ---       ---     ---
       0       4         100     100       1.5       2
       1       4          50      50       1.5       2
       2       1         100     100         0       0
       3       1          48      48         0       0

The number of boundaries ('nInterfaces') becomes 0 as all processor faces become internal.

  • cyclicAMI boundaries:
                              nCells         nInterfaces
   Level  nProcs         avg     max         avg     max
   -----  ------         ---     ---         ---     ---
       0       4         100     100           3       3
       1       4          50      50           3       3
       2       1         100     100           2       2
       3       1          48      48           2       2

Here the number of boundaries goes from 3 to 2 since only the two cyclicAMI get preserved.

Distributed cyclicAMI

A big benefit of cyclicAMI is that the source and target faces do not have to reside on the same processor. This is handled internally using a distribution map:

  • AMI.srcMap() : transfers the source-side data to the target side.
  • AMI.tgtMap() : transfers the target-side data to the source side.

When assembling the cyclicAMI interface from the various contributing processors a large part is the assembling of the src and tgt maps. Each map consists of local data (myProcNo) followed by data from the various remote (proci != myProcNo) procs:

index contents
0 local
localSize
remote from 0
remote from 1
..
remote from n
constructSize

The data is constructed by

  • starting from the local data
  • using the subMap to indicate which elements of this local data need to go where
  • making additional space to receive remote data
  • using the constructMap to indicate where the received data slots into

If we start from four processors and combine the processors 0,1 into new 0 and 2,3 into new 1 the assembled layout is agglomerated:

  • the local data is the agglomeration of local datas so local data on new0 is old0, old1
  • the remote data is sorted according to originating 'new' processor (so new0 agglomerates the data sent to old procs 0,1 from old procs2,3)
  • any remote data from assembled processors is removed (since it is now in the assembled local slots)

The two maps indexing the data will be renumbered accordingly. In general most maps will have lots of local data and just a bit of remote (note that this might not be optimal for cyclicAMI purposes since quite likely the two sides get decomposed onto separate processors) so the new numbering is

  • startOfLocal[mapi] : gives for mapi (assumed to originate from rank mapi) the offset in the assembled data
  • compactMaps[mapi][index] : gives the mapi the new index for every old index

Notes

  • cyclicAMI with all faces becoming local will be reset to become non-distributed i.e. directly operating on provided fields without any additional copying.
  • cyclicAMI with a rotational transformation is not yet supported. This is not a fundamental limitation but requires additional rewriting of the stencils to take into account transformations.
  • processorCyclic (a cyclic with owner and neighbour cells on different processors) is not yet supported. This is treated as a normal processor boundary so will loose any transformation. Note that processorCyclic can be avoided by using the patches constraint in decomposeParDict, e.g.
constraints
{
    patches
    {
        //- Keep owner and neighbour on same processor for faces in patches
        //  (only makes sense for cyclic patches and cyclicAMI)
        type    preservePatches;
        patches (cyclic);
    }
}
  • only masterCoarsest has been tested but the code should support any other processor-agglomeration method.

Merge request reports