Finite Difference application speedup

Abhishek_S_ · ‎01-05-2017

I have a FD application for wave propagation. Executed it on KNC and got linear speedup till 16 threads and after that got sub-linear speedup till 180 threads but from 180 to 240 threads there was drop in speedup. Can someone let me know reason behind this behaviour.

Thanks,

Abhishek

jimdempseyatthecove · ‎01-05-2017

Linear speedup up to 16 threads (scatter) indicates you may have hit a memory bandwidth issue, trail off after 180 usually indicates hitting the vector unit FPU pipeline throughput capacity. I suggest you try to optimize L1 and L2 cache utilization.

Jim Dempsey

McCalpinJohn · ‎01-05-2017

It is not uncommon for high-bandwidth codes to experience performance losses when using "too many" threads. This is primarily due to DRAM bank conflicts.

On KNC there are 16 32-bit GDDR5 channels, with a pair of 16-bit-wide GDDR5 chips on each channel. The GDDR5 DRAM chips each have 16 banks, so there are a total of 256 DRAM banks available. (The two chips on each channel operate in lockstep, so their banks combine into a single set of wider banks instead of acting as independent banks.)

Using the STREAM benchmark, I get the best performance on KNC when using one thread per core. For the STREAM Triad kernel, there are two read streams and one write stream for each thread, so a 61-thread job generates 183 memory access streams on the Xeon Phi SE10P. These 183 memory access streams fit reasonably well into the 256 available banks. When I go to 2 threads per core, the code is now generating 366 memory access streams, and these do not fit into the 256 available banks, so performance drops as the banks have to be repeatedly opened and closed to satisfy the competing/conflicting accesses.

The same sort of behavior occurs on mainstream Xeon processors (e.g., Xeon E5-2690 v3), and in these processors there are enough hardware performance counter events in the memory controllers to verify that too many memory access streams will cause large increases in the DRAM bank conflict rates.

Most finite difference codes generate a lot more memory access streams per thread than the STREAM benchmark. If the number of memory access streams is too high, you may run into performance problems before you even reach one thread per core. If I recall correctly, my preferred version of the "SWIM" 2D shallow water model reached its maximum performance on KNC with somewhere between 32 and 48 threads (spread across 32 to 48 cores). Part of this was due to increased DRAM bank conflicts as the thread count increased and part was due to the increasing overhead of parallel synchronization for the short parallel loops handling the boundaries of the 2D grid.

This reduction in performance with increasing thread counts is not a problem if the bandwidth obtained in the best cases is close to the expected maximum sustainable values. For KNC, well-behaved codes using transparent huge pages could typically sustain bandwidths of over 140 GB/s using the best configurations. STREAM did a bit better (about 175 GB/s), and some sparse matrix-vector codes (with more reads and few writes) delivered over 180 GB/s sustained bandwidth.

SergeyKostrov · ‎02-09-2017

>>...Executed it on KNC and got linear speedup till 16 threads and after that got sub-linear speedup till 180 threads but >>from 180 to 240 threads there was drop in speedup... I've experienced similar performance problems on a KNL system and this is how it looked like: [ KNL - KMP_AFFINITY=scatter - compiled for MIC-AVX512 ISA ] Processing Started... [ Items: 32 ] [ OpenMP threads: 1 ] Processing Completed... ( 73 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 2 ] Processing Completed... ( 37 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 4 ] Processing Completed... ( 19 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 8 ] Processing Completed... ( 10 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 16 ] Processing Completed... ( 5 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 32 ] Processing Completed... ( 3 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 64 ] Processing Completed... ( 1 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 128 ] Note: 2x oversubscription / ~19x slower from [ OpenMP threads: 64 ] case Processing Completed... ( 19 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 256 ] Note: 4x oversubscription / ~23x slower from [ OpenMP threads: 64 ] case Processing Completed... ( 23 secs ) As you can see performance significantly degraded if more than 64 OpenMP threads are used.

SergeyKostrov · ‎02-09-2017

On an Ivy Bridge system processing had reached a "saturation" for significantly more OpenMP threads ( CPU has 4 cores and 2 hardware threads ): [ IB - KMP_AFFINITY=scatter - compiled for AVX ISA ] Processing Started... [ Items: 32 ] [ OpenMP threads: 1 ] Processing Completed... ( 13 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 2 ] Processing Completed... ( 6 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 4 ] Processing Completed... ( 3 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 8 ] Note: 2x oversubscription Processing Completed... ( 3 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 16 ] Note: 4x oversubscription Processing Completed... ( 3 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 32 ] Note: 8x oversubscription Processing Completed... ( 3 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 64 ] Note: 16x oversubscription Processing Completed... ( 3 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 128 ] Note: 32x oversubscription Processing Completed... ( 4 secs ) ... Processing Started... [ Items: 32 ] [ OpenMP threads: 256 ] Note: 64x oversubscription Processing Completed... ( 4 secs )

SergeyKostrov · ‎02-09-2017

In my case processing was compute-bound ( not memory-bound ) and I've solved that problem by re-implementing a main processing piece of codes of OpenMP processing in order to optimize cache lines access and eliminate false sharing problems. That is, I modified codes from ... #pragma omp parallel for private ( i ) num threads ( N ) for( i = 0; i < n; i += 1 ) { ...Processing... } ... to ... Explicitly initialize a range of i-values for N OpenMP threads size_t iRangeStart[ N ] size_t iRangeEnd[ N ] ... omp_set_num_threads( N ) ... #pragma omp parallel { ... num = omp_get_thread_num ... for( i = iRangeStart[ num ]; i < iRangeEnd([ num ]; i += 1 ) { ...processing... } } ...

SergeyKostrov · ‎02-10-2017

>>...@Sergey: "On an Ivy Bridge system processing had reached a "saturation" for significantly more OpenMP threads ( CPU has >>4 cores and 4 hardware threads ):" I've corrected my typing error in the Post #5: ... On an Ivy Bridge system processing had reached a "saturation" for significantly more OpenMP threads ( CPU has 4 cores and 2 hardware threads )... ... It was a typing error and it doesn't change results of processing.