Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Slow performace on multi-CPU machine

iburda
Beginner
2,315 Views

I suppose this question is similar to

https://community.intel.com/t5/Intel-Integrated-Performance/bad-performance-on-multi-CPU-server/m-p/812120

although I did not find a viable solution.

My code resamples and warps 3D arrays of unsigned shorts (Ipp16u) using  ipprResize_16u_C1PV and ipprWarpAffine_16u_C1PV correspondingly. It runs significantly slower on the machine with two CPUs as opposed to one. Specifications for both machines are in the table below.

 

Processor

Base Speed

Cores (non-HT)

Total Cores

Turbo Boost

Memory Channel Supported

Total RAM

# DIMMs Used

Max DIMMs

Density

Speed

GPUs

VRAM

2CPU Server

Dual Xeon Gold 6244

3.6

8C

16

4.4

6

384

12

24

32

2933

4x nVidia RTX Titans

4x24GB

1CPU
Z6 w/s

Xeon Gold 6244

3.6

8C

8

4.4

6

96

6

6

16

2933

Quadro P5000

1x 16GB

 

I profiled using the same data set that runs 2262 resample-warp loop iterations. Call duration statistics (in ms) are in the table below:

    Resize Warp
PC CPUs Average Min Max StDev Median 90 percentile Average Min Max StDev Median 90 percentile
Z6 1 18.96 16.229 27.307 2.565 20.09 22.13 131.473 106.695 176.012 10.998 128.339 147.579
Server 2 47.88 39.671 69.411 10.224 40.479 64.176 554.825 397.543 945.275 56.093 573.442 605.424

 

IPP library is 2020.0.0 Gold. I defined _IPP_PARALLEL_STATIC before including ipp.h. This C++ project is build in Microsoft Visual Studio 2015.

During initialization I call ippSetNumThreads with the number of logical CPUs (32 for Server and 16 for Z6 workstation). Afterwards I call ippGetNumThreads and confirm that the actual value matches what I set.

The full profiling data is attached. To view first extract the json data files from attachments and then
1. Open Google Chrome browser.
2. Type "chrome://tracing" in address bar and hit Enter.
3. Drag and drop json file on Chrome browser window.
4. Use Alt+Scroll wheel to zoom in and out of time line.

ipprResize_16u_C1PV appears in the profiling data as ippResample.
ipprWarpAffine_16u_C1PV appears under ippWarp name.

Thank you.

0 Kudos
8 Replies
Ruqiu_C_Intel
Moderator
2,302 Views

Hi iburda,


Thanks for raising your issue, we will investigate it internally. And will back to here if there is any update.


Thanks,

Ruqiu


0 Kudos
iburda
Beginner
2,290 Views

Hi Ruqiu,

Thank you for quick acknowledge.
I forgot to provide the dimensions of my 3D arrays (input, intermediate after resize,
and output after warp).
XYZ [149 x 607 x 335] -> Resize -> [167 x 678 x 335] -> Warp -> [592 x 678 x 118].
The warp transform matrix is
0.707106781186548   0.000000000000000   -1.414213562373095   473.761543394986973
0.000000000000000   1.000000000000000   0.000000000000000   0.000000000000000
0.707106781186548   0.000000000000000   -0.000000000000000   0.000000000000037
0.000000000000000   0.000000000000000   0.000000000000000   1.000000000000000


Thank you.

Ilya.

 

 

0 Kudos
Ruqiu_C_Intel
Moderator
2,271 Views

Hi Ilya,

It's hard to investigate the json file. Could you help send out the reproducer for better investigating the issue?

Thanks,

Ruqiu

0 Kudos
iburda
Beginner
2,247 Views

Hello Ruqiu,

Sorry, it took a while to make the bare minimum reproducer. Win32 console app is attached (AWReproducer.zip). It takes two command line parameters (number of loops and interpolation method).
The usage is demonstrated if executed with ? parameter. The easiest way to run is by launching the
test.bat. This batch file executes the program once for each interpolation method (from 0 to 5). 3D data is loaded from the included Data.bin file. Matrix for 3D warp is loaded from matrix.ini file. There are three kinds of output. Duration of each loop is printed on the console. The batch script redirects the output from all attempts into a single text file "log.txt". More detailed logging is sent to debug output and can be captured with DebugView utility (https://docs.microsoft.com/en-us/sysinternals/downloads/debugview). The program also produces profile.json files, that can be
examined with Google Chrome browser as I described originally.
The source code is included in Source directory.
I ran the same reproducer to compare the two machines I originally described. The summary is in attached "Performance Comparison Z6_vs_GPUServer.xlsx" spreadsheet. Cell array C4:N13 contains
durations of each Resample+Warp attempt in milliseconds.
The single CPU machine (Z6) performance is considered as 100%. Then the speed of dual Xeon (GPUServer) is consistently below 80% for all interpolation modes.

Thank you.

Ilya.

0 Kudos
Andrey_B_Intel
Employee
2,163 Views

Hello Ilya.

Thanks for your reproducer but it is still too complicated for analysis. I've prepared very simple reproducer to compare single vs threaded versions of ipprResize_16u_C1PV and ipprWarpAffine_16u_C1PV. Actually the first function does not contain special threaded code but the second has. I see at my Skylake system(intel xeon silver 4116 with 48Cores with HT enabled)  that ipprResize_16u_C1PV does not have any effect from threading but ipprWarpAffine_16u_C1PV scales well between threads. Could you please check the attached reproducer at your system? Also please try to set KMP_AFFINITY=scatter and restrict the number of threads to physical number of cores in your system? (24 in my case). See the details in the attachment.

Thanks, Andrey.

iburda
Beginner
2,114 Views

Hello Andrey,

Thank you so much! Your reproducer is indeed much simpler. I ran your reproducer and confirmed
what you described. Resize does not scale and Warp scales. The summary spreadsheet is attached.
The non-scalable Resize is slightly slower on dual CPU machine than on single CPU one. I realized that my dual CPU machine (called "GPU Server") had a drastically different hardware (chipset, motherboard, BIOS, etc.) from the single CPU one (Supermicro vs. HP correspondingly). So I found another dual CPU machine with hardware much closer to the single CPU one and compared among all three machines.

Name CPU Motherboard
Z6 Single Xeon Gold 6244 HP
Z8 Dual Xeon Gold 6244 HP
GPU Server Dual Xeon Gold 6244 Supermicro

Single threaded Resize code runs consistently faster on single CPU Z6 machine. Dual CPU Z8 is close second and GPU Server is distant third. Multithreaded Warp scales, but the performance is quite sensitive to the thread count. More threads can easily be slower than fewer threads. 16 threads seem optimal. 32 threads on dual CPU machines are marginally faster than 16 threads (and sometimes even slower).

Thank you again for your effort.

Sincerely,

Ilya.

0 Kudos
Ruqiu_C_Intel
Moderator
1,719 Views

This issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only. 

0 Kudos
Ruqiu_C_Intel
Moderator
1,718 Views

Since have reproduced and confirmed the results. We are closing the thread now and will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only. 

0 Kudos
Reply