Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
75 Views

omp different time execution

Hi,

I am working with complicated algorithm, that requires a lot of computations. For this purpose, Intel Xeon Platinum 8168 CPU  was puchased - with 96 Cores(name  96). Besides, we already have Intel Core i9 -7960X CPU with 16 Cores(name 16). 

I'm  running "omp pragma parallel for" directive on  FOR  loop in  order to get  parallel calculations. First, I tried this approach on 16 PC, with 5, 10  and 15  iterations on that  FOR  loop, and got almost  same  results for all three cases (and that is correct, since not all CPU power was used).

Next step, I ran same code on 96 PC. I also have tried  different number of iterations ,and I see an constantly increasing  time in total execution.

On 40 iterations, total time increased almost twice, and on 90 iterations it increased in 3.5 times(still, NOT full power of CPU as it have 96 cores!!).

I'm aware of threading pool, and time needed to create such amount of threads but still, that seems not working well at all.  Does omp have specific problems with   Intel Xeon Platinum processors, that I not aware of it? Maybe something about it's architecture, that not complied with omp.

* It is not a problem of cooling hardware, since tests are running about 3 minutes.

* It is not a problem of allocations or memory copy, since there is exactly same amount of memory allocated and memory copied. 

Could you think of a possible problems, because I have run out of ideas.

Thanks

 

0 Kudos
12 Replies
Highlighted
Employee
75 Views

Hi Gregory,

Is there any IPP issue you've faced with? Which IPP version? Which IPP functionality? 

regards, Igor

0 Kudos
Highlighted
75 Views

Sorry. It seems that this post was more appropriate to be post in Intel® C++ Compiler  forum. No. I wasn't using IPP functions. Is there a way to move this post? 

0 Kudos
Highlighted
Moderator
75 Views

moved this thread to the Intel compiler c/c++ forum

0 Kudos
Highlighted
75 Views

We need more information about your application.

In particular, a sketch of where you place the omp directives (and what they are), and what type of code is being called.

For example, if for each iteration, you have for that iteration, a loop that you parallelized. What is the iteration count of that loop? (e.g. number of particles???).

Further, do the functions used by each thread within this parallel loop call functions that itself contains parallel loops (IOW nested parallelism).
*** Note, if the answer is NO, because your called code is not containing omp directived, then should these functions contain calls to the MKL library, you must link in the Single-Threaded MKL library. Using the Multi-Threaded MKL library by a thread within a parallel region will initiiate nested parallel regions. IOW cause your program to generate 96*96 threads (and suffer oversubscription).

For parallel program - use single threaded MKL library
For serial program - use multi-threaded MKL library

Jim Dempsey

0 Kudos
Highlighted
75 Views

The whole code is  really  complicated.  I will  give you a brief sketch here:

... // Some Code  here.

..... 

for LOOP (dynamic number of iterations)  // omp pragma parallel for  on  this  loop

{

    ...// Some code here

     Called function, lets say MyFunc, that consumes  85%  of a CPU  time , does a lot of calculations, matrix operations, etc. 

     ...// Some code here

....// Some Code here

* as for a number of iterations - as I mentioned in original post, I have tested different number of iterations, even  exceeding  the number of CORES twice, just to check stability of code (in real world, it will be around 20 iterations).

* about  code inside MyFunc - there are almost no memory allocations,  almost  all  code  is about  calculations  and  matrix  operations, and for operations I mean only reading. Sometimes, I believe, some threads will fall on same index in matrix  simultaneously. Don't think it can interfere too much. Sure,  other functions called inside, but there is no parallelism them. This is the only place.

* I wasn't using  MKL library before. I mean, OMP  is not a part of MKL  library, so it doesn't have to be installed.

* But the most interesting part, is about such different behavior on different CPU's  of the SAME code(with same configurations) . I thought, that maybe because there is 1 physical  CPU  on  16  PC  and  4  physical CPU's  on  96  PC, there is an overhead  communicating  between different CPU's . 

0 Kudos
Highlighted
75 Views

Try using OpenMP tasking:

#pragma omp parallel
{
  #pragma omp master
  {
    for(int i=0; i<outerCount; ++i)
    {
      #pragma omp task
      {
         MyFunc(i);
      }
    }
  }
}
...
void MyFunc(i)
{
   int i;
   ...
   // at this point you have two independent sections A and B
   #pragma omp task
   {
      ... // section A
   }
   #pragme omp task
   {
      ... // section B
   }
   #pragma omp taskwait
  ...
}

The tasks that you launch have to have sufficient amount of work to be effective.

I did not show any of the potential OpenMP clauses (e.g. private, shared, reduce....)

Jim Dempsey

0 Kudos
Highlighted
75 Views

I had written a nicer post prior to #7, but the IDZ forum logged my out (rebooted) while composing the message.

Recap of that message:

If your parallelization is only at this outer loop

   for LOOP (dynamic number of iterations)  // omp pragma parallel for  on  this  loop

then the degree of parallelization is at most the number of iterations of that loop.

For a maximum count of 20, then only 20 of your 96 threads (or 192 threads) would be utilized. To achieve greater parallelization, you will need to dig deeper into your code, as outlined in post #7, or use nested OpenMP parallel regions. Though using OpenMP Tasks as in #7 may be more effective. Nested parallel regions efficiency is an algorithmic art. A "best" strategy is to determine the largest outer loop thread count that produces the least value remainder of the modulus of the outer loop count with inner loop count (2 through outer loop count) *** with special case for outer loop count mod max threads == 0, and in which case inner loop is not parallelized.

Nested parallel regions will work best/better when the outer loop count does not vary within a run of the application.

Jim Dempsey

0 Kudos
Highlighted
75 Views

Thanks for your response. Unfortunately,  I am using VS 2017 that doesn't support  task directive, as it seems 

https://stackoverflow.com/questions/23545930/openmp-tasks-in-visual-studio
I got  "Error    C3001    'task': expected an OpenMP directive name". 

Can you please explain a bit of this part. Not so clear :

"A "best" strategy is to determine the largest outer loop thread count that produces the least value remainder of the modulus of the outer loop count with inner loop count (2 through outer loop count) *** with special case for outer loop count mod max threads == 0, and in which case inner loop is not parallelized."

* I installed MKL  library and ran new code, very  slight improvement.

* I already tried nested omp pragma(there is also a FOR  loop  inside MyFunc, that  does a lot of work) , still a performance very low. 

0 Kudos
Highlighted
75 Views

Your objective is, provided that there is sufficient work for each thread used, to incorporate the most threads possible. (there are cases where fewer than most threads possible is better, but ignore this for now).

Assume for this example each iteration of your loop as executed in single thread mode.
Assume your outer loop has 20 iterations
Assume the MyFunc code has significant (almost all) code that can be processed by 2 threads

using 1 thread in MyFunc:
  16 iterations run in parallel
  +4 iterations run in parallel, 12 threads idel (not used)

Total time is 2/16 of serial

using 2 threads in MyFunc
  8 iterations in parallel, executed by 2 threads in parallel
  +8 iterations in parallel, executed by 2 threads in parallel
  +4 iterations in parallel, executed by 2 threads in parallel, with 8 threads idel (not used)

in the second case you've kept more threads working

The strategy is: given the number of iterations (degree of parallelization) of the outer loop, the number of threads available, and the degree of parallelization of the inner section, determine the mix of threads used on the outer loop and inner code such that the most threads are kept busy.
*** when the circumstance is such that you have multiple most used combinations, take the one with the most threads on the outer loop.

With your 96 cores and 20 iterations of outer loop:

96 - 20 = 76 number of threads remaining
76 / 20 = 3.8
the MyFunc would use 3 threads and have to have at least 3 degrees of parallelization
20 * 3 = 60 threads utilized

Or consider using 10 threads on outer loop
96 - 10 = 86
86 / 10 = 8.6
the MyFunc would use 8 threads and have to have at least 8 degrees of parallelization
10 * 8 = 80 threads utilized

Choice depends  on the degree to which the inner code can be parallelized

Jim Dempsey

 

 

 

0 Kudos
Highlighted
75 Views

>> I already tried nested omp pragma(there is also a FOR  loop  inside MyFunc, that  does a lot of work) , still a performance very low. 

In order to use nested parallel regions you must instruct OpenMP to enable nested parallel regions (disabled by default).

Use environment variable:

     OMP_NESTED=true

or function call:

    omp_set_nested(1); // *** prior to first parallel region

Then something like this **untested** sketch;

int outerCount = N; // number between say 1 and 20 
int MaxThreads = omp_get_max_threads();
int OuterLoopThreadCount = std::min(MaxThreads, N);
int InnerLoopThreadCount = std::max(1, N / OuterLoopThreadCount);


int bestOuterLoopThreadCount = OuterLoopThreadCount;
int bestInnerLoopThreadCount = InnerLoopThreadCount;
int bestIdelThreadCount = (N%OuterLoopThreadCount) * InnerLoopThreadCount;

// find minimum idel thread count for last parallel iteration of outer loop
for(;bestIdelThreadCount;)
{
  if(OuterLoopThreadCount == 1)
    break;
  --OuterLoopThreadCount;
  InnerLoopThreadCount = std::max(1, N / OuterLoopThreadCount);
  IdelThreadCount = (N%OuterLoopThreadCount) * InnerLoopThreadCount;
  if(IdelThreadCount < bestIdelThreadCount)
  {
    bestOuterLoopThreadCount = OuterLoopThreadCount;
    bestInnerLoopThreadCount = InnerLoopThreadCount;
    bestIdelThreadCount = IdelThreadCount;
  }
}

OuterLoopThreadCount = bestOuterLoopThreadCount;
InnerLoopThreadCount = bestInnerLoopThreadCount;
IdelThreadCount = bestIdelThreadCount;

...
if(outerCount == 1)
  MyFunc(0);
else
{
  
  #pragma omp parallel for num_threads(OuterLoopThreadCount)
  for(int i=0; i<outerCount; ++i)
  {
    MyFunc(i);
  }
}


...
void MyFunc(int i)
{
   ...
   
   #pragma omp parallel for num_threads(InnerLoopThreadCount)
   for(int j=0; j<innerLoopCount; ++j)
   {
      ...
   }
}

*** You must assure thread-safe-ness of your code. This generally involves adding clause private(...), firstprivate(...), reduction(...), lastprivate(...) and/or other clauses.

Jim Dempsey

0 Kudos
Highlighted
75 Views

OK . I have tried it different ways.

As for your #11 example.  I have inner loop count iterations for about 720.

So I tried  to run nested omp() - enabled omp_set_nested(1).

First tried 20 threads in outer loop and  3/4  threads in  inner loop.

Second tried 10 threads in outer loop and 8/9/10  threads in  inner loop. Not just there wasn't improvement, but it get worse time in 4.5 times!!!

Than tried to run parallelism only on inner LOOP. Since there should be much better use of CPU power (since number  of iterations is greater than CORES), I expected to get even better results (instead  of parallelizing only outer loop with 20 threads).And once again, I got time  worse in 5 times (((.

Thanks for your help, but  it seems that there is just no solution. 

BTW - there are configurations for  optimization for  Intel Xeon Platinum CPU in BIOS, that I wasn't aware of them. Worth noting in the future, if someone is get stuck like me, thought didn't help.

 

I will copy  his answer here:

"The Core i9-7960x processor has 16 physical cores in one chip.  It is capable of supporting HyperThreading, which if enabled would give the appearance of 32 cores.

The Xeon Platinum 8168 processor has 24 physical cores in one chip.  It is capable of supporting HyperThreading, which if enabled would give the appearance of 48 cores.   The Xeon Platinum 8168 supports systems with up to 8 chips, so a system that reports 96 Logical Processors is either a 2-socket system with HyperThreading enabled, or a 4-socket system with HyperThreading disabled.  

The easiest way to get ~1/2 performance is to run one thread per physical core on the Core i9-7960x, but to run two threads per physical core on the Xeon Platinum 8168.   

The binding of threads to cores is difficult to control in a completely system-independent fashion.   If you want to compare single-socket performance, you will need to limit the code to running on a single socket of the Xeon Platinum 8168.  In Linux systems, this is typically done with a command like:

numactl --membind=0 --cpunodebind=0 a.out

This forces all memory and processors to be allocated from socket "0".   You should then set OMP_NUM_THREADS to the number of threads you want to test, and (MOST IMPORTANT) set OMP_PROC_BIND=spread.   The "spread" option will cause the OpenMP runtime to distribute the threads as far apart as possible.   Assuming that the Xeon Platinum 8168 system has HyperThreading enabled, the "spread" option will cause one thread to be bound to each *physical* core until all physical cores have been used.  For this single socket case you will be able to use up to 24 threads without "doubling up" on any physical core in socket 0.

Some OpenMP jobs will run better if the threads are spread uniformly across the sockets.  Again assuming that the Xeon Platinum 8168 system has two sockets and HyperThreading enabled, you simply set OMP_PROC_BIND=spread, and then set OMP_NUM_THREADS to the desired number of threads.  Since there are two sockets, you probably want to set the number of threads to an even value.   The "spread" option will place 1/2 of the threads on separate physical cores in the first socket and will place the other half of the threads on separate physical cores in the second socket.

"Dr. Bandwidth"

0 Kudos
Highlighted
75 Views

Do you have Intel's VTune? If so, I suggest you learn how to use it. While it may be the case that your program is incapable of threading to 96 threads, it is often the case that a simplified approach is not the best way to parallelize your application. Have you considered seeking professional help?

Jim Dempsey

0 Kudos