Community
cancel
Showing results for 
Search instead for 
Did you mean: 
74 Views

omp pragma different time

Hi,

I'm  working  on a complicated  project, that requires a parallel  calculations  in  order to achieve  good time performance.

Our  company  bought  for  this purpose  intel xeon platinum 8168  processor (96 CORES - name 96).  Also, we have a  computer  with intel core i9 7960x processor(16 CORES - name 16). 

I'm  using  "omp  pragma  for"  directive,  as all  calculations  happen in  FOR  loops. And at this point i got strange results.

I'm running  my code  on  16  PC,  with number of iterations in FOR loop  less than 16. That's mean, that number of threads & number of CORES  that are  used  less than 16. At this point, I got almost same time results(I mean, 5 iterations, 10 iterations, and 15 iterations  complete with almost same time). And this is correct, since NOT ALL CPU  power was  used.

At this point, I try  and run SAME  code  on  96. And I see strange time performance results. If I run 40 iterations(see 40 threads), the time is almost twice against 1 iteration. And if I run 90 iterations(still, NOT full power!!),  time increase almost 4 times. 

My questions is, does it have some issue to this specific processor (intel xeon platinum 8168  processor)  working with IPP libraries ? 

What could be the possible reason for such a time complexity increase?  I am aware about dynamic memory allocations , we have some,  and time needed to create large number of threads, but still  that's seem  not the real reason.

Thanks   

 

0 Kudos
3 Replies
Gennady_F_Intel
Moderator
74 Views

moving thread from intel IPP to the Intel C/C++ compiler forum

McCalpinJohn
Black Belt
74 Views

The Core i9-7960x processor has 16 physical cores in one chip.  It is capable of supporting HyperThreading, which if enabled would give the appearance of 32 cores.

The Xeon Platinum 8168 processor has 24 physical cores in one chip.  It is capable of supporting HyperThreading, which if enabled would give the appearance of 48 cores.   The Xeon Platinum 8168 supports systems with up to 8 chips, so a system that reports 96 Logical Processors is either a 2-socket system with HyperThreading enabled, or a 4-socket system with HyperThreading disabled.  

The easiest way to get ~1/2 performance is to run one thread per physical core on the Core i9-7960x, but to run two threads per physical core on the Xeon Platinum 8168.   

The binding of threads to cores is difficult to control in a completely system-independent fashion.   If you want to compare single-socket performance, you will need to limit the code to running on a single socket of the Xeon Platinum 8168.  In Linux systems, this is typically done with a command like:

numactl --membind=0 --cpunodebind=0 a.out

This forces all memory and processors to be allocated from socket "0".   You should then set OMP_NUM_THREADS to the number of threads you want to test, and (MOST IMPORTANT) set OMP_PROC_BIND=spread.   The "spread" option will cause the OpenMP runtime to distribute the threads as far apart as possible.   Assuming that the Xeon Platinum 8168 system has HyperThreading enabled, the "spread" option will cause one thread to be bound to each *physical* core until all physical cores have been used.  For this single socket case you will be able to use up to 24 threads without "doubling up" on any physical core in socket 0.

Some OpenMP jobs will run better if the threads are spread uniformly across the sockets.  Again assuming that the Xeon Platinum 8168 system has two sockets and HyperThreading enabled, you simply set OMP_PROC_BIND=spread, and then set OMP_NUM_THREADS to the desired number of threads.  Since there are two sockets, you probably want to set the number of threads to an even value.   The "spread" option will place 1/2 of the threads on separate physical cores in the first socket and will place the other half of the threads on separate physical cores in the second socket.

74 Views

Thanks for your response.

I have checked option of  HyperThreading in BIOS  and it was disabled.  But  the strange thing, that after enabling it, it is even worse performance  in  time.