Re: Re:Slow performace on multi-CPU machine

iburda · ‎08-13-2020

I suppose this question is similar to

https://community.intel.com/t5/Intel-Integrated-Performance/bad-performance-on-multi-CPU-server/m-p/812120

although I did not find a viable solution.

My code resamples and warps 3D arrays of unsigned shorts (Ipp16u) using ipprResize_16u_C1PV and ipprWarpAffine_16u_C1PV correspondingly. It runs significantly slower on the machine with two CPUs as opposed to one. Specifications for both machines are in the table below.

	Processor	Base Speed	Cores (non-HT)	Total Cores	Turbo Boost	Memory Channel Supported	Total RAM	# DIMMs Used	Max DIMMs	Density	Speed	GPUs	VRAM
2CPU Server	Dual Xeon Gold 6244	3.6	8C	16	4.4	6	384	12	24	32	2933	4x nVidia RTX Titans	4x24GB
1CPU Z6 w/s	Xeon Gold 6244	3.6	8C	8	4.4	6	96	6	6	16	2933	Quadro P5000	1x 16GB

I profiled using the same data set that runs 2262 resample-warp loop iterations. Call duration statistics (in ms) are in the table below:

		Resize						Warp
PC	CPUs	Average	Min	Max	StDev	Median	90 percentile	Average	Min	Max	StDev	Median	90 percentile
Z6	1	18.96	16.229	27.307	2.565	20.09	22.13	131.473	106.695	176.012	10.998	128.339	147.579
Server	2	47.88	39.671	69.411	10.224	40.479	64.176	554.825	397.543	945.275	56.093	573.442	605.424

IPP library is 2020.0.0 Gold. I defined _IPP_PARALLEL_STATIC before including ipp.h. This C++ project is build in Microsoft Visual Studio 2015.

During initialization I call ippSetNumThreads with the number of logical CPUs (32 for Server and 16 for Z6 workstation). Afterwards I call ippGetNumThreads and confirm that the actual value matches what I set.

The full profiling data is attached. To view first extract the json data files from attachments and then
1. Open Google Chrome browser.
2. Type "chrome://tracing" in address bar and hit Enter.
3. Drag and drop json file on Chrome browser window.
4. Use Alt+Scroll wheel to zoom in and out of time line.

ipprResize_16u_C1PV appears in the profiling data as ippResample.
ipprWarpAffine_16u_C1PV appears under ippWarp name.

Thank you.

Ruqiu_C_Intel · ‎08-14-2020

Hi iburda,

Thanks for raising your issue, we will investigate it internally. And will back to here if there is any update.

Thanks,

Ruqiu

iburda · ‎08-14-2020

Hi Ruqiu,

Thank you for quick acknowledge.
I forgot to provide the dimensions of my 3D arrays (input, intermediate after resize,
and output after warp).
XYZ [149 x 607 x 335] -> Resize -> [167 x 678 x 335] -> Warp -> [592 x 678 x 118].
The warp transform matrix is
0.707106781186548 0.000000000000000 -1.414213562373095 473.761543394986973
0.000000000000000 1.000000000000000 0.000000000000000 0.000000000000000
0.707106781186548 0.000000000000000 -0.000000000000000 0.000000000000037
0.000000000000000 0.000000000000000 0.000000000000000 1.000000000000000

Thank you.

Ilya.

Ruqiu_C_Intel · ‎08-17-2020

Hi Ilya,

It's hard to investigate the json file. Could you help send out the reproducer for better investigating the issue?

Thanks,

Ruqiu

iburda · ‎08-28-2020

Hello Ruqiu,

Sorry, it took a while to make the bare minimum reproducer. Win32 console app is attached (AWReproducer.zip). It takes two command line parameters (number of loops and interpolation method).
The usage is demonstrated if executed with ? parameter. The easiest way to run is by launching the
test.bat. This batch file executes the program once for each interpolation method (from 0 to 5). 3D data is loaded from the included Data.bin file. Matrix for 3D warp is loaded from matrix.ini file. There are three kinds of output. Duration of each loop is printed on the console. The batch script redirects the output from all attempts into a single text file "log.txt". More detailed logging is sent to debug output and can be captured with DebugView utility (https://docs.microsoft.com/en-us/sysinternals/downloads/debugview). The program also produces profile.json files, that can be
examined with Google Chrome browser as I described originally.
The source code is included in Source directory.
I ran the same reproducer to compare the two machines I originally described. The summary is in attached "Performance Comparison Z6_vs_GPUServer.xlsx" spreadsheet. Cell array C4:N13 contains
durations of each Resample+Warp attempt in milliseconds. The single CPU machine (Z6) performance is considered as 100%. Then the speed of dual Xeon (GPUServer) is consistently below 80% for all interpolation modes.

Thank you.

Ilya.

Andrey_B_Intel · ‎09-11-2020

Hello Ilya.

Thanks for your reproducer but it is still too complicated for analysis. I've prepared very simple reproducer to compare single vs threaded versions of ipprResize_16u_C1PV and ipprWarpAffine_16u_C1PV. Actually the first function does not contain special threaded code but the second has. I see at my Skylake system(intel xeon silver 4116 with 48Cores with HT enabled) that ipprResize_16u_C1PV does not have any effect from threading but ipprWarpAffine_16u_C1PV scales well between threads. Could you please check the attached reproducer at your system? Also please try to set KMP_AFFINITY=scatter and restrict the number of threads to physical number of cores in your system? (24 in my case). See the details in the attachment.

Thanks, Andrey.

iburda · ‎09-22-2020

Hello Andrey,

Thank you so much! Your reproducer is indeed much simpler. I ran your reproducer and confirmed
what you described. Resize does not scale and Warp scales. The summary spreadsheet is attached.
The non-scalable Resize is slightly slower on dual CPU machine than on single CPU one. I realized that my dual CPU machine (called "GPU Server") had a drastically different hardware (chipset, motherboard, BIOS, etc.) from the single CPU one (Supermicro vs. HP correspondingly). So I found another dual CPU machine with hardware much closer to the single CPU one and compared among all three machines.

Name	CPU	Motherboard
Z6	Single Xeon Gold 6244	HP
Z8	Dual Xeon Gold 6244	HP
GPU Server	Dual Xeon Gold 6244	Supermicro

Single threaded Resize code runs consistently faster on single CPU Z6 machine. Dual CPU Z8 is close second and GPU Server is distant third. Multithreaded Warp scales, but the performance is quite sensitive to the thread count. More threads can easily be slower than fewer threads. 16 threads seem optimal. 32 threads on dual CPU machines are marginally faster than 16 threads (and sometimes even slower).

Thank you again for your effort.

Sincerely,

Ilya.

Ruqiu_C_Intel · ‎05-05-2021

This issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Ruqiu_C_Intel · ‎05-05-2021

Since have reproduced and confirmed the results. We are closing the thread now and will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.