Missing header file for amgxwrapper
Hi!
I’ve been successfully running GPU accelerated solvers via petsc4foam for three months now and decided to give the amgxwrapper branch of this project a try to see if there were any performance gains over PETSc.
Amgxwrapper compilation starts fine on OF2012 but at some point the compiler notifies that AmgXCSRMatrix.H file is missing. However, I cannot seem to locate this file anywhere.
Has this AmgXCSRMatrix.H header file been made publicly available?
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Link issues together to show that they're related. Learn more.
Activity
- Developer
yes, for license issues it is hosted in another public repo:
https://gitlab.hpc.cineca.it/openfoam/foam2csr
Keep in mind that it's in a development stage. We recently update the APIs of foam2csr.
Regarding doc, you can find some presentations made by NVIDIA online. I will provide compilation instructions asap.
- Developer
@WinstonMechanics We recently merged an improved, GPU-resident, assembly routine into
master
that should work with the latest PETSc release (3.15). Did you try it? Any feedback? - Developer
Sorry @szampini but is this improvement merged into the master branch of petsc or petsc4foam?
- Developer
Support for using GPU resident assembly in PETSc (available starting from 3.15) has been merged in 'develop' of this repo cc029f24.
- Please register or sign in to reply
- Author
Many thanks! I'll take a look at the foam2csr library.
This new master branch matrix assembly routine seems promising, I'm currently running PETSc 3.14.5 and I'll update to 3.15 shortly and run some tests!
Edited by Winston Virtaus - Developer
When working with GPUs, I suggest you use the development version of PETSc to be up to date with the most recent contributions.
- Author
Thanks for the tip @szampini, I'll compile the latest versions of PETSc4Foam and PETSc today.
- Author
The newer matrix assembly routine seems very well optimised now. According to PETSc profiling the conversion overhead dropped from around 3% to around 0.5%-0.7% of total run time which is great!
The latest development version of PETSc (main branch) consumes a lot more GPU memory than v3.14.5. The memory consumption has more than doubled, is this expected?
Edit: I forgot to add, the increased memory consumption can be seen even with the basic PETSc tutorials e.g. ksp/ex2 so it most likely caused by PETSc itself.
Edited by Winston Virtaus - Developer
Thanks for reporting this. Can you be more specific so that I can take a look?
- Author
I'll send some logs tomorrow which will show the solver and the compilation settings for more details.
- Author
Attached you will find logs produced by two different PETSc versions (v3.14.5 and the latest one from main branch) and a snapshot of peak memory consumption through nvidia-smi. These are obtained by running the tutorial in src/ksp/ksp/tutorials/ex2 with the command
mpirun -np 4 ./ex2 -n 1500 -m 1500 -ksp_type cg -pc_type gamg -mat_type mpiaijcusparse -vec_type mpicuda -log_view
PETSc_main.logmemory_PETSc_main.logPETSc_v3_14_5.logmemory_PETSC_v3_14_5.log
This system is running on NVIDIA GP104 architecture.
The PETSc configuration can be seen from the logs and I'll happily provide any other info that's needed!
- Developer
@WinstonMechanics The increase in memory is likely due to using the matrix-matrix operations on the GPU. Cusparse requires an insane amount of workspace memory. Can you try using
-PtAP_matproduct_ptap_via scalable -matptap_via scalable
? - Author
@szampini I tried running the ./ex2 with
mpirun -np 4 ./ex2 -n 1500 -m 1500 -ksp_type cg -pc_type gamg -mat_type mpiaijcusparse -vec_type mpicuda -PtAP_matproduct_ptap_via scalable -matptap_via scalable -log_view -options_left
Here are the logs for the latest main branch version of PETSc and the older v3.14.5.
PETSc_main_via_scalable.logmemory_PETSc_main_via_scalable.logPETSC_v3_14_5_via_scalable.logmemory_PETSC_v3_14_5_via_scalable.log
The
-PtAP_matproduct_ptap_via scalable
and-matptap_via scalable
options were unrecongized by the newer PETSc version for some reason. The older version seems to recognize the-matptap_via scalable
option. - Developer
Sorry, try
-matptap_backend_cpu
-pc_gamg_square_matmatmult_backend_cpu
withmain
. These options should force the operations to be done on the CPU. - Author
Many thanks! Adding those commands seemed to have the desired effect. Now peak memory consumption on main is very close to what previous versions had.
mpirun -np 4 ./ex2 -n 1500 -m 1500 -ksp_type cg -pc_type gamg -mat_type mpiaijcusparse -vec_type mpicuda -PtAP_matproduct_ptap_via scalable -matptap_via scalable -matmatmult_backend_cpu -matptap_backend_cpu -log_view -options_left
PETSc_main_cpu_backend.logmemory_PETSc_main_cpu_backend.log
The
-PtAP_matproduct_ptap_via scalable
is still ignored but the performance seems good already.
- Author
@sbna I was able to get the AmgX v2.2.0 solver running using the latest AmgXWrapper + foam2csr. I need to run some more tests later this week.
Is there a good way to get AmgX compute the scaled L1 norm that is usually used by OpenFOAM?
Edited by Winston Virtaus - Developer
yes, but it is in a private branch of a NVIDIA developer. It will be included in the next release of AmgX. You can ask @mmartineau to have access to https://github.com/mattmartineau/amgx-prerelease (branch openfoam) for testing purposes.
- Author
@mmartineau I'm running some GPU tests using AmgX + foam2csr but would like to access a version that has the OpenFOAM L1 norm built in.
Is it possible to have testing access for user https://github.com/WinstonMechanics to the amgx-prelease branch that was mentioned by @sbna?
- Author
Thanks for the AmgX prerelease access! @mmartineau @sbna @szampini
Here are some test results gathered from several runs using
AmgX Prerelease
,PETSc 3.15.2 main
with cpu backend andOpenFOAM v2012
solvers onicoFoam
andsimpleFoam
test cases. Only the pressure equation has been accelerated on the GPU as this seemed to give the best overall performance. I also included results from thePETSc 3.14.6
version that uses the older matrix assembly routine.Speedups over 1x on the graph means that the total solution time was faster than using the foam-amg-pcg solver and less than 1x means that the solver did not outperform the foam-amg-pcg solver.
The AmgX solver does indeed give a nice speedup over PETSc on a mid-spec desktop machine and its faster than the foam-AMG-PCG solver out of the box for pretty much all domain sizes ranging from 0.1M to 10M. This is different from PETSc which starts to outperform the stock solvers after domain size exceeds about 1M cells. The scaling of the AmgX solver is fairly close to PETSc.
There doesn't seem to be much difference on matrix conversion overhead between AmgX and the newest PETSc matrix assembly routine. They both nicely reduce the matrix conversion costs to very small levels compared to the older implementation. The matrix conversion overhead is calculated as the total time spent in matrix conversion divided by the total solution time of each run.
The icoFoam results are obtained from the
HPC Lid_driven_cavity-3d benchmark
in which the cell count is scaled and ran for 100 timesteps at CFL 0.5. Similarly, the simpleFoam results are from thepitzDaily
tutorial which has been scaled up, pressure equation relTol set to 0.01, tolerance to 1e-18 and ran for 100 timesteps.The PETSc solver setup used in these runs is pretty much equal to https://develop.openfoam.com/modules/external-solver/-/blob/develop/tutorials/basic/laplacianFoam/pipeOneD/system/fvSolution-petsc-gamg-device apart from using a value of around 0.01 on the pc_gamg_threshold and 4 smoother iterations. The foam-amg-pcg setup was taken from https://develop.openfoam.com/committees/hpc/-/blob/develop/Lid_driven_cavity-3d/XL/system/fvSolution.FOAM-GAMG-PCG.fixedNORM. The AmgX setup is similar to https://github.com/NVIDIA/AMGX/blob/main/core/configs/PCG_CLASSICAL_V_JACOBI.json with aggressive PMIS coarsening, more smoothing and obviously the L1_SCALED norm.
One thing that popped up was that if the cpu backend on PETSc is used it was more prone to producing indefinite PCs. I had to use pc_gamg_reuseinterpolation false on the pitzDaily testcase to obtain convergence which results in a significant performance drop. The problem seems to somehow be related to the matptap_backend_cpu command which needs further study.
Test setup details:
OS: Ubuntu 20.04
CPU: I5-4690k
GPU: GTX-1070, Driver 465.19.01, Cuda 11.3
OpenFOAM: v2012
MPI: OpenMPI 4.1.0
Edited by Winston Virtaus - Developer
Thanks for these numbers. PETSc is expected to perform badly for small problem sizes, due to higher latencies and the many blocking stream calls (will be improved in the next weeks). You can try
-pc_gamg_cpu_pin_coarse_grids
to run coarser grids on the CPU backend to probably improve the results.Would be interesting to compare the number of iterations. I assume these are sequential runs
One thing that popped up was that if the cpu backend on PETSc is used it was more prone to producing indefinite PCs. I had to use pc_gamg_reuseinterpolation false on the pitzDaily testcase to obtain convergence which results in a significant performance drop. The problem seems to somehow be related to the matptap_backend_cpu command which needs further study.
Can you attach the fvSolution and petscOptions file to reproduce it? I will fix it. Thanks for reporting it.
- Author
Good to hear that active development is taking place!
Here's the pitzDaily case that's giving the issue with fvSolution and petscOptions included. There's also logs from two different runs, one with the cpu backend issue and one performing as usual when no cpu backend options are used.
- Author
I did some more testing with the suggested
-pc_gamg_cpu_pin_coarse_grids
option.The performance is pretty much identical to the previous numbers. The logs show very little difference apart from some additional back and fourth cpu-gpu communication.
Should the coarse grid matrix type be manually changed to something like
seqaij
? The logs show that its still set toseqaijcusparse
even if the coarse grids are pinned to cpu. This leads me to think the coarse grids are still run on gpu, see attached logs. These are from the lid driven cavity flow case with around 1M cells.Were you able to reproduce the bug I reported regarding the cpu backend?
icoFoam_cpu_pin_coarse_grids_true.log icoFoam_cpu_pin_coarse_grids_false.log
Hi @WinstonMechanics, I tried to test the performance of the AmgX as the preconditioner together with PCG solver, as mentioned in "AMGX GPU SOLVER DEVELOPMENTS FOR OPENFOAM".
I have installed AmgX, AmgXWrapper, foam2csr and petsc4foam (amgxwrapper branch) with OpenFOAM-v2012, and modified the Lid-driven-cavity test case with the following fvSolution file, but the preconditioner amgx is still missing. fvSolution.PETSc-AMGX-PCG.fixedNORM log.icoFoam.PETSC-AMGX-PCG.gpu0_1_2_3.np4.08-05-16-37
Please let me know if there is any problem with my installation or fvSolution setting.
- Author
Hello @li12242, I havent tried using AmgX only as a preconditioner but im using it as a full solver instead. You can try putting for example
p { solver amgx; amgx {}; tolerance 1e-06; relTol 0.1; }
to your
fvSolution
file and then start modifying the additionalamgxpOptions
configuration file to suit your needs.One additional note, I had to put
libs ("libpetscFoam.so");
to thecontrolDict
to get the solver properly load. Otherwise the entrysolver amgx;
infvSolution
would not have been properly recognized.I've attached a sample fvSolution file and a minimal AmgX solver configuration file from a pitzDaily case.
Hi Winston. Thank you for your help. I am getting an error while running simpleFoam in the pitzDaily case. During the pressure iteration there seems to be a problem regarding the linking of the libraries.
`Initializing PETSC
simpleFoam: symbol lookup error: .../OpenFOAM-v2206/platforms/linux64GccDPInt320pt/lib/libfoam2csr.so: undefined symbol: AMGX_initialize`
Any idea what might be the problem and how can I fix it?
- Author
Hi @cudagu,
You can try ldd libfoam2csr.so and see if any of the links are missing.
This is the make/options file that has worked for me in foam2csr compilation, maybe it is useful in your case also.
#33 (comment 58566) Hi Winston, The problem was indeed related to a shared object file that was missing. Thanks for that.
I was able to solve a 3d lid-driven cavity problem with icoFoam using some of the parts of "fvSolution" and "amgxpOptions" that you shared.
For the 100x100x100 domain size, AMGX-PCG solver with A100 80gb resulted in 4x speed up over OpenFOAM PCG solver 27cpu.
For the 200x200x200 domain size, AMGX did not offer any speed up. In fact, it was slower.
Do you expect such a behavior?
Also, I see that you used amgx solver only for the pressure equation. Is it possible to use it also for the velocity (and turbulence for more complex flows) equations? Do I need to set amgx_Options for each equation? Would you expect any improvement if Ux, Uy and Uz were solved with amgx?
Thank you for sharing your time.
Edited by Hakan Ari- Author
Good to hear!
The solver config I shared was just a minimal example to check the installation to see if everything is working all right. It's just a Jacobi preconditioned CG solver.
For actual performance runs you can try using https://github.com/NVIDIA/AMGX/blob/main/src/configs/PCG_CLASSICAL_V_JACOBI.json as a starting point to get the advantages of multigrid. Just remember to use the option
"norm": "L1_SCALED"
to use the same residual norm definition as OpenFOAM. Also set"convergence": "RELATIVE_INI_CORE"
or"convergence": "ABSOLUTE"
depending on your case setup and of course set the residual tolerance appropriately e.g."tolerance": 1e-04
to get a decent comparison to the default solver. You may also need to add"store_res_history": 1
.Usually the Jacobi-CG solver is fast for small domain sizes and gradually becomes less performant for larger matrices. The multigrid preconditioned CG is usually the way to go for larger problems. Its a good idea to check what setup has the best scaling with respect to problem size. Often multigrid has the best scaling properties.
I've had the best results GPU accelerating only the Poisson pressure equation on hardware that has limited memory. Solving the pressure equation is usually the largest bottleneck in the solving process so it makes sense to GPU accelerate this portion first. If there's leftover memory then you can indeed solve other equations on the GPU, just choose the amgx solver in the fvsolution and define the appropriate files e.g. amgxUxOptions, amgxUyOptions etc. You can see how much performance you get, but I'd expect that the improvement is more modest than with the pressure equation.
Hi @WinstonMechanics, your comments were very helpful.
I am having trouble using the L1_SCALED norm of OpenFOAM using AMGX2.2.0. Do I also need prelease access?
L2 norm straight out results in floating point error while L1 norm results in a thrust error on the gpu side after a few timesteps. Did you encounter a similar problem?
I am not sure why I get the thrust error with L1 norm because GPU memory is never above 20/80gb with 3d cavity problem and residuals seems fine throughout the analysis. Thanks!
- Author
I think L1_SCALED norm has been introduced into the main branch a few versions ago, its possible version 2.2.0 didn't yet have it. I'm currently using version 2.4.0 with cuda 12.2 without any major issues. It seems that using L2 norm in the lid driven flow indeed causes a crash in the 100x100x100 mesh. The L1 norm didnt result in a crash for me.
You can try updating to the newest version and see if the errors still persist. It sounds like some kind of runtime memory error is happening that could very well been sorted out in the newest version.
Hi @WinstonMechanics, thanks so much for your information. With the configure file "amgxpOptions", the icoFoam solver runs successfully with petsc4foam + AmgX.
- Diego Mayer mentioned in issue #33
mentioned in issue #33