# Optimization to ldu Matrix

Merged requested to merge optimizationBranch into develop

## 1 Goal

The goal of this change is to improve the sparse solver in OpenFOAM in terms of computational efficiency and communication reduction.

The related modifications come in three categories:

• remove the redundant codes in symGaussSeidelSmoother.C.
• fuse the “ACf = Ax” with “coarseSources -= ACF” to get “coarseSources -=Ax” codes in GAMGSolverSolve.C.
• reduce the number of “Allreduce” in PCG.C.

## 2 Proposed Change and Results

### 2.1 symGaussSeidelSmoother.C

The symGaussSeidel runs the following algorithm:

• Forward: x_i^{k+1}=\frac{1}{a_{ii}} (b-∑_{j=1}^{i-1}a_{ij} x_j^{k+1} -∑_{j=i+1}^na_{ij} x_j^k)
• Backward: x_i^{k+2}=\frac{1}{a_{ii}} (b-∑_{j=1}^{i-1}a_{ij} x_j^{k+1} -∑_{j=i+1}^na_{ij} x_j^{k+2})

The realization in OpenFOAM is as follows：

We can see that the last step, i.e. $bPrime=bPrime-∑_{j=i+1}^n a_{ij} x_i^{k+2}$, is redundant. Mapping to the code realization, we can get the “distribute neighbor side …” step is redundant as long as forall facei > ownStartPtr[celli], uPtr[facei] > celli.

We test the modification on cavity 256*512 case. The results show that symgs operation improves by 5%，and the results keeps the same.

• Orginal： Total 28.19s , symGS time = 28.19*62.32% =17.568s
• Modified: Total 27.39s , symGS time = 27.39*61.16% =16.75s

### 2.2 GAMGSolverSolve.C

Original code writes “ACf = Ax” together with “coarseSources -= ACF”. We can combine them together to get “coarseSources -=Ax”. Furthermore, “ACF” is not used here. We can move the declaration of “ACF” to where it is needed. I think the declaration of ACF can be further simplified, which is done here.

The fusion and the move of the declaration will improve the performance (not much, but better than none) and reduce the kernel used.

### 2.3 PCG.C

The original PCG have three “Allreduce” procedure, which is the most time-consuming part when running large scale system. Note that the last “Allreduce” procedure is used to calculate the residual for convergence. Then we can move the last “Allreduce” procedure to the next iteration, and combine it with the first “Allreduce” procedure. The figure illustrates the aforementioned change.

The proposed change is able to reduce iterNo/3 number of “Allreduce” process and iterNo number of “r” variable load at the cost of one more precondition. （iterNo：the number of iterations）

The proposed change will improve performance if

• run PCG for solving coarsest mesh.
• run PCG with preconditioner DIC/diagonal/none. (test with cavity512*512, Kunpeng 920 128cores, DIC preconditioned PCG)
• Orginal： Total 2.46s , shared memory MPI = 2.46*33.92% =0.83s
• Modified： Total 2.11s , shared memory MPI = 2.11*28.38% =0.6s
• run PCG with preconditioner GAMG with many cores or many iterations.