Skip to content
Snippets Groups Projects
Commit 09f834dd authored by Sergey Lesnik's avatar Sergey Lesnik
Browse files

Update DLRCJH description

- Use efficiency metric instead of confusing normalized
  time-to-solution. Efficiency is calculated as ratio of speedup to
  ideal speedup.
parent 0d06b755
No related merge requests found
......@@ -263,50 +263,50 @@ In the case of 3M, the MPI ranks were pinned in a uniform manner over a node for
![speedup_3M.png](figures/speedup_3M.png)
| nCores | nNodes | Ideal speedup | Speedup | Time per time step in s | Normalized time to solution in µs | Relative Standard Deviation in % |
| :----: | :----: | :-----------: | :-----: | :----------------------: | :-------------------------------: | :------------------------------: |
| 1 | 1 | 1 | 1 | 36.1 | 12 | 1.16 |
| 2 | 1 | 2 | 2.02 | 17.9 | 11.9 | 0.0563 |
| 4 | 1 | 4 | 3.88 | 9.31 | 12.4 | 0.063 |
| 8 | 1 | 8 | 7.65 | 4.73 | 12.6 | 0.0852 |
| 16 | 1 | 16 | 14.6 | 2.48 | 13.2 | 0.15 |
| 32 | 1 | 32 | 26 | 1.39 | 14.8 | 0.102 |
| 64 | 1 | 64 | 37.4 | 0.965 | 20.6 | 0.143 |
| 128 | 1 | 128 | 43.6 | 0.83 | 35.4 | 0.374 |
| 256 | 2 | 256 | 103 | 0.351 | 30 | 0.497 |
| 512 | 4 | 512 | 226 | 0.16 | 27.3 | 0.957 |
| 1024 | 8 | 1024 | 406 | 0.0891 | 30.4 | 1.05 |
| nCores | nNodes | Ideal speedup | Speedup | Time per time step in s | Efficiency | Relative Standard Deviation in % |
|:--------:|:--------:|:---------------:|:---------:|:--------------------------:|:------------:|:----------------------------------:|
| 1 | 1 | 1 | 1 | 36.1 | 1 | 1.16 |
| 2 | 1 | 2 | 2.02 | 17.9 | 1.01 | 0.0563 |
| 4 | 1 | 4 | 3.88 | 9.31 | 0.971 | 0.063 |
| 8 | 1 | 8 | 7.65 | 4.73 | 0.956 | 0.0852 |
| 16 | 1 | 16 | 14.6 | 2.48 | 0.91 | 0.15 |
| 32 | 1 | 32 | 26 | 1.39 | 0.814 | 0.102 |
| 64 | 1 | 64 | 37.4 | 0.965 | 0.585 | 0.143 |
| 128 | 1 | 128 | 43.6 | 0.83 | 0.34 | 0.374 |
| 256 | 2 | 256 | 103 | 0.351 | 0.402 | 0.497 |
| 512 | 4 | 512 | 226 | 0.16 | 0.441 | 0.957 |
| 1024 | 8 | 1024 | 406 | 0.0891 | 0.396 | 1.05 |
### 24M
The 24M case behaviour is similar to the 3M case with multiple-node decomposition: with higher number of nodes, larger parts of the problem fit into the cache, resulting in the superlinear scaling.
![speedup_24M.png](figures/speedup_24M.png)
| nCores | nNodes | Ideal speedup | Speedup | Time per time step in s | Normalized time to solution in µs | Relative Standard Deviation in % |
| :----: | :----: | :-----------: | :-----: | :----------------------: | :-------------------------------: | :------------------------------: |
| 128 | 1 | 1 | 1 | 11.5 | 61.2 | 0.134 |
| 256 | 2 | 2 | 2.26 | 5.07 | 54.1 | 0.148 |
| 512 | 4 | 4 | 5.4 | 2.12 | 45.3 | 0.164 |
| 1024 | 8 | 8 | 13.4 | 0.857 | 36.6 | 0.241 |
| 2048 | 16 | 16 | 30.9 | 0.372 | 31.7 | 0.963 |
| 4096 | 32 | 32 | 67.4 | 0.17 | 29.1 | 2.3 |
| 8192 | 64 | 64 | 116 | 0.0993 | 33.9 | 2.64 |
| 16384 | 128 | 128 | 148 | 0.0776 | 53 | 0.689 |
| nCores | nNodes | Ideal speedup | Speedup | Time per time step in s | Efficiency | Relative Standard Deviation in % |
|:--------:|:--------:|:---------------:|:---------:|:--------------------------:|:------------:|:----------------------------------:|
| 128 | 1 | 1 | 1 | 11.5 | 1 | 0.134 |
| 256 | 2 | 2 | 2.26 | 5.07 | 1.13 | 0.148 |
| 512 | 4 | 4 | 5.4 | 2.12 | 1.35 | 0.164 |
| 1024 | 8 | 8 | 13.4 | 0.857 | 1.68 | 0.241 |
| 2048 | 16 | 16 | 30.9 | 0.372 | 1.93 | 0.963 |
| 4096 | 32 | 32 | 67.4 | 0.17 | 2.11 | 2.3 |
| 8192 | 64 | 64 | 116 | 0.0993 | 1.81 | 2.64 |
| 16384 | 128 | 128 | 148 | 0.0776 | 1.16 | 0.689 |
### 489M
![speedup_489M_largeRunsOnly.png](figures/speedup_489M_largeRunsOnly.png)
| nCores | nNodes | Ideal speedup | Speedup | Time per time step in s | Normalized time to solution in µs | Relative Standard Deviation in % |
| :----: | :----: | :-----------: | :-----: | :----------------------: | :-------------------------------: | :------------------------------: |
| 1024 | 8 | 1 | 1 | 50.3 | 105 | 0.158 |
| 2048 | 16 | 2 | 2.54 | 19.8 | 83.1 | 0.208 |
| 4096 | 32 | 4 | 5.68 | 8.86 | 74.2 | 1.62 |
| 8192 | 64 | 8 | 16.9 | 2.98 | 50 | 0.246 |
| 16384 | 128 | 16 | 42.3 | 1.19 | 39.8 | 0.451 |
| 32768 | 256 | 32 | 97.8 | 0.515 | 34.5 | 1.1 |
| 65536 | 512 | 64 | 161 | 0.313 | 41.9 | 3.36 |
| 131072 | 1024 | 128 | 264 | 0.191 | 51.2 | 4.95 |
| nCores | nNodes | Ideal speedup | Speedup | Time per time step in s | Efficiency | Relative Standard Deviation in % |
|:--------:|:--------:|:---------------:|:---------:|:--------------------------:|:------------:|:----------------------------------:|
| 1024 | 8 | 1 | 1 | 50.3 | 1 | 0.158 |
| 2048 | 16 | 2 | 2.54 | 19.8 | 1.27 | 0.208 |
| 4096 | 32 | 4 | 5.68 | 8.86 | 1.42 | 1.62 |
| 8192 | 64 | 8 | 16.9 | 2.98 | 2.11 | 0.246 |
| 16384 | 128 | 16 | 42.3 | 1.19 | 2.64 | 0.451 |
| 32768 | 256 | 32 | 97.8 | 0.515 | 3.06 | 1.1 |
| 65536 | 512 | 64 | 161 | 0.313 | 2.52 | 3.36 |
| 131072 | 1024 | 128 | 264 | 0.191 | 2.06 | 4.95 |
### Superlinear speedup
The benchmark cases show higly superlinear speedup. A profiling analysis during the exaFOAM project identified the relevant cause of these findings to be the large amount of L3 memory cache on the modern CPUs. With the increasing number of partitions, the number of cells and thus the amount of data per core decreases. This results in a larger portion of data, which is consumed for calculations, to be in cache for the complete duration of the computation. Since the memory access to cache is orders of magnitude faster than the access to RAM, the performance increases dramatically.
......@@ -398,7 +398,7 @@ The resulting cost-to-solution is presented in the table below. The cost is give
| Reconstruct | 1; 600GB | – | 4.37h x 600GB / 2GB = 1311 | – |
| Total | – | – | 7843 | 252 |
The total cost of the pre- and post-processing of collated and coherent formats are 7843 and 252 core-hours, respectively, leading to a factor 31 improvement in cost-to-solution. Note that often reconstruction is performed for more than one time step, which would provide the coherent format a larger benefit.
The total cost of the pre- and post-processing of collated and coherent formats are 7843 and 252 core-hours, respectively, leading to a factor 31 improvement in cost-to-solution (CTSF=31). Note that often reconstruction is performed for more than one time step, which would provide the coherent format a larger benefit.
### Write Performance
A measurement campaign was setup to ensure that the write performance during the XiFoam run is not degraded when using the coherent format. The cases were setup in a such way that the writing of fields is triggered within an interval of at least 10s of wall clock time in order to keep the stress to LUMI’s Lustre system low and thus eliminate any impact of the file system on the measurements. Each time step written to storage consists of 32 volume fields (writing of the 3 surface fields was not supported by the implementation available at the time of benchmarking) and amounts to 148GB of data.
......@@ -425,9 +425,9 @@ Therefore, the hierarchical decomposition method is recommended in this case.
## Summary
The enhancements in the turbulence modelling algorithm and in the IO revealed efficient performance up to 131,072 cores. At high core counts, the multi-level decomposition method emerged as promising, reducing the inter-node communication bandwidth. Particularly noteworthy was the introduction of the coherent file format, which effectively addressed major pre- and post-processing bottlenecks, resulting in significantly faster decomposition times and notable cost reductions. Write performance analyses showcased comparable results between collated and coherent formats, hinting at further potential for optimization and refinement in the future for the coherent format. The following improvements were demonstrated:
- Case decomposition: up to 60 times lower time-to-solution
- Overall pre- and post-processing: 31 times lower cost-to-solution
- Solver execution: 30 times lower time-to-solution
- Case decomposition: TTSF=60 - up to 60 times lower time-to-solution
- Overall pre- and post-processing: CTSF=31 - 31 times lower cost-to-solution
- Solver execution: TTSF=30 - 30 times lower time-to-solution
# Acknowledgment
This application has been developed as part of the exaFOAM Project https://www.exafoam.eu, which has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 956416. The JU receives support from the European Union's Horizon 2020 research and innovation programme and France, Germany, Italy, Croatia, Spain, Greece, and Portugal.
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment