# Significant DILU preconditioner speedup

### Functionality to add/problem to solve

The current implementation of the DILU preconditioner uses the losort addressing for the forward sweep. See line 129 DILUPreconditioner.C:

```
for (label face=0; face<nFaces; face++)
{
const label sface = losortPtr[face];
wAPtr[uPtr[sface]] -=
rDPtr[uPtr[sface]]*lowerPtr[sface]*wAPtr[lPtr[sface]];
}
```

To understand the proposed change, consider the following 4x4 matrix and its LDU representation. For illustrative purposes, the matrix is dense:

```
\begin{bmatrix}
a_{00} & a_{01} & a_{02} & a_{03} \\
a_{10} & a_{11} & a_{12} & a_{13} \\
a_{20} & a_{21} & a_{22} & a_{23} \\
a_{30} & a_{31} & a_{32} & a_{33}
\end{bmatrix} =
\begin{bmatrix}
& & & \\
a_{10} & & & \\
a_{20} & a_{21} & & \\
a_{30} & a_{31} & a_{32} &
\end{bmatrix} +
\begin{bmatrix}
a_{00} & & & \\
& a_{11} & & \\
& & a_{22} & \\
& & & a_{33}
\end{bmatrix} +
\begin{bmatrix}
& a_{01} & a_{02} & a_{03} \\
& & a_{12} & a_{13} \\
& & & a_{23} \\
& & &
\end{bmatrix}
```

The current code traverses the coefficients of the lower matrix (`L`

) in sorted order, i.e., `a_{10}, a_{20}, a_{21}, a_{30}, a_{31}, a_{32}`

, and adds their contribution to the **row** that they are part of:

```
wA[1] -= rD[1]*lower[0]*wA[0];
// done calculating wA[1]
wA[2] -= rD[2]*lower[1]*wA[0];
wA[2] -= rD[2]*lower[3]*wA[1];
// done calculating wA[2]
wA[3] -= rD[3]*lower[2]*wA[0];
wA[3] -= rD[3]*lower[4]*wA[1];
wA[3] -= rD[3]*lower[5]*wA[2];
// done calculating wA[3]
```

In other words, the current code calculates wA row by row. This is not optimal because it requires an additional memory lookup (via `losortPtr`

) and, more importantly, the lower coefficients are not accessed in a contiguous manner. This is hard on the cache and leads to cache misses.

I propose the following change:

```
for (label face=0; face<nFaces; face++)
{
wAPtr[uPtr[face]] -=
rDPtr[uPtr[face]]*lowerPtr[face]*wAPtr[lPtr[face]];
}
```

This code traverses the coefficients of the upper matrix (`U`

) in sorted order, i.e., `a_{01}, a_{02}, a_{03}, a_{12}, a_{13}, a_{23}`

, and adds their contribution to the **column** that they are part of:

```
wA[1] -= rD[1]*lower[0]*wA[0];
// done calculating wA[1]
wA[2] -= rD[2]*lower[1]*wA[0];
wA[3] -= rD[3]*lower[2]*wA[0];
wA[2] -= rD[2]*lower[3]*wA[1];
// done calculating wA[2]
wA[3] -= rD[3]*lower[4]*wA[1];
wA[3] -= rD[3]*lower[5]*wA[2];
// done calculating wA[3]
```

In other words, after the algorithm finishes the calculation of a row, it distributes the contribution of the finished row to other rows.

From the perspective of the algorithm, it is important that no row of `wA`

is used before it is fully calculated. This is the case in both implementations. The only difference is the order in which the coefficients are traversed. The result is numerically identical.

I run a quick benchmark on the `scalarTransportFoam/pitzDaily`

testcase. In average, the preconditioner (forward + backward sweep) is around 31% faster.
On an interesting note, the DILU smoother already uses the proposed implementation.

### Target audience

Users of the DILU preconditioner.

### Proposal

See above.

### What does success look like, and how can we measure that?

nA