Optimizing a nested loop using ipp

Bombey__Jasper · ‎09-09-2019

I have an application with a 3 dim nested loop where the innermost loop is performing: output += in * scalar, where i≈1000, j≈200, k≈4000 and the data type is a complex float (8 bytes). I'm running in Linux on an Intel CPU with 40 cores (20 w/ ht). The application can be configured to distribute the work in the i dimension across multiple threads.

Without using IPP the total calculation completes in: ~11.6sec for 1thread and ~0.48sec for 40 threads for a speedup of ~24x. This shows an acceptable speedup but the total time is too long.

With IPP, using ippsAddProduct_32fc() with length of k for each call, the total calculation completes in ~1.16sec for 1 thread and ~0.27sec for 40 threads for a speedup of only ~4x. This total time is much lower, but the speedup is nowhere near what I'd like.

Can anyone please explain the lack of speedup and/or suggest ways to improve the speedup when using multiple threads with IPP.

jimdempseyatthecove · ‎09-10-2019

You may be experiencing a memory bandwidth limitation. Run VTune to get the numbers.

Jim Dempsey

Bombey__Jasper · ‎09-11-2019

I don't have much experience with VTune, but I ran it for various thread counts and here's the results:

Threads: 1, Time: 1.2sec, DRAMBound: 0%, ipp call LLC Miss Count: 98%

Threads: 2, Time: 0.6sec, DRAMBound: 0%, ipp call LLC Miss Count: 97%

Threads: 4, Time: 0.43sec, DRAMBound: 4.3%, ipp call LLC Miss Count: 98.6%

Threads: 8, Time: 0.28sec, DRAMBound: 81%, ipp call LLC Miss Count: 98%

Thread: 10, Time: 0.26sec, DRAMBound: 92%, ipp call LLC Miss Count: 98%

Thread: 20, Time: 0.26sec, DRAMBound: 75%, ipp call LLC Miss Count: 98%

Thread: 40, Time: 0.23sec, DRAMBound: 90%, ipp call LLC Miss Count: 96%

Looks like once 4 or more threads are used it starts to become DRAMBound and the speedup doesn't scale optimally. Is there any way to optimize the nested loop (change index ordering/change array sizes) to utilize less memory (or utilize the memory more efficiently) across many threads?

jimdempseyatthecove · ‎09-13-2019

Look at the vectorization reports to see how well (bad) the program is vectorized. I usually use VTune, and examine the disassembly code of the hot spots. If they are not vectorized then figure out why and fix it. Most often poor vectorization is due to inefficient data organization.

The compile also has vectorization reports.

Jim Dempsey

Bombey__Jasper · ‎09-19-2019

Looking at assembly code in VTune as you recommended it seems the problem areas are vectorized, I think, but I'm not certain since I'm not too familiar with assembly code. Here are the instructions that were highlighted in red:

vmovshdupy (%rsi), %ymm3 CPI: 4.8

vmovsldupy (%rsi), %mm4 CPI: 6.2

vaddpsy 0x20(%rdx), %ymm9, %ymm11 CPI: 6.576

If they are already vectorized is there anyway to improve the CPI, and thus the overall performance?

McCalpinJohn · ‎09-19-2019

Complex code often runs significantly faster if it is stored as separate real and imaginary vectors and the complex arithmetic is inserted manually. The SIMD vector approach works a lot better with data that is contiguous in memory.

jimdempseyatthecove · ‎09-20-2019

>> it seems the problem areas are vectorized

There is vectorizaton, then there is vectorization. Not all are equal.

The instructions you listed indicates you are compiling for SSE3 instruction set on system with AVX capability. I cannot explain why the VTune report is listing the %mm4 register (MMX instruction portion of %xmm4, %ymm4, or %zmm4), it is likely because that instruction is not supported in the disassembly of the op code as an AVX instruction.

As John stated, when you have arrays of complex numbers, it is much more computationally efficient to split the "array" into two parts (two arrays). Then perform a little more programming to perform the operations.

Jim Dempsey