Significant DILU preconditioner speedup
Functionality to add/problem to solve
The current implementation of the DILU preconditioner uses the losort addressing for the forward sweep. See line 129 DILUPreconditioner.C:
for (label face=0; face<nFaces; face++)
{
const label sface = losortPtr[face];
wAPtr[uPtr[sface]] -=
rDPtr[uPtr[sface]]*lowerPtr[sface]*wAPtr[lPtr[sface]];
}
To understand the proposed change, consider the following 4x4 matrix and its LDU representation. For illustrative purposes, the matrix is dense:
\begin{bmatrix}
a_{00} & a_{01} & a_{02} & a_{03} \\
a_{10} & a_{11} & a_{12} & a_{13} \\
a_{20} & a_{21} & a_{22} & a_{23} \\
a_{30} & a_{31} & a_{32} & a_{33}
\end{bmatrix} =
\begin{bmatrix}
& & & \\
a_{10} & & & \\
a_{20} & a_{21} & & \\
a_{30} & a_{31} & a_{32} &
\end{bmatrix} +
\begin{bmatrix}
a_{00} & & & \\
& a_{11} & & \\
& & a_{22} & \\
& & & a_{33}
\end{bmatrix} +
\begin{bmatrix}
& a_{01} & a_{02} & a_{03} \\
& & a_{12} & a_{13} \\
& & & a_{23} \\
& & &
\end{bmatrix}
The current code traverses the coefficients of the lower matrix (L
) in sorted order, i.e., a_{10}, a_{20}, a_{21}, a_{30}, a_{31}, a_{32}
, and adds their contribution to the row that they are part of:
wA[1] -= rD[1]*lower[0]*wA[0];
// done calculating wA[1]
wA[2] -= rD[2]*lower[1]*wA[0];
wA[2] -= rD[2]*lower[3]*wA[1];
// done calculating wA[2]
wA[3] -= rD[3]*lower[2]*wA[0];
wA[3] -= rD[3]*lower[4]*wA[1];
wA[3] -= rD[3]*lower[5]*wA[2];
// done calculating wA[3]
In other words, the current code calculates wA row by row. This is not optimal because it requires an additional memory lookup (via losortPtr
) and, more importantly, the lower coefficients are not accessed in a contiguous manner. This is hard on the cache and leads to cache misses.
I propose the following change:
for (label face=0; face<nFaces; face++)
{
wAPtr[uPtr[face]] -=
rDPtr[uPtr[face]]*lowerPtr[face]*wAPtr[lPtr[face]];
}
This code traverses the coefficients of the upper matrix (U
) in sorted order, i.e., a_{01}, a_{02}, a_{03}, a_{12}, a_{13}, a_{23}
, and adds their contribution to the column that they are part of:
wA[1] -= rD[1]*lower[0]*wA[0];
// done calculating wA[1]
wA[2] -= rD[2]*lower[1]*wA[0];
wA[3] -= rD[3]*lower[2]*wA[0];
wA[2] -= rD[2]*lower[3]*wA[1];
// done calculating wA[2]
wA[3] -= rD[3]*lower[4]*wA[1];
wA[3] -= rD[3]*lower[5]*wA[2];
// done calculating wA[3]
In other words, after the algorithm finishes the calculation of a row, it distributes the contribution of the finished row to other rows.
From the perspective of the algorithm, it is important that no row of wA
is used before it is fully calculated. This is the case in both implementations. The only difference is the order in which the coefficients are traversed. The result is numerically identical.
I run a quick benchmark on the scalarTransportFoam/pitzDaily
testcase. In average, the preconditioner (forward + backward sweep) is around 31% faster.
On an interesting note, the DILU smoother already uses the proposed implementation.
Target audience
Users of the DILU preconditioner.
Proposal
See above.
What does success look like, and how can we measure that?
nA