omp different time execution

Nikolaev__Gregory · ‎10-31-2019

Hi,

I am working with complicated algorithm, that requires a lot of computations. For this purpose, Intel Xeon Platinum 8168 CPU was puchased - with 96 Cores(name 96). Besides, we already have Intel Core i9 -7960X CPU with 16 Cores(name 16).

I'm running "omp pragma parallel for" directive on FOR loop in order to get parallel calculations. First, I tried this approach on 16 PC, with 5, 10 and 15 iterations on that FOR loop, and got almost same results for all three cases (and that is correct, since not all CPU power was used).

Next step, I ran same code on 96 PC. I also have tried different number of iterations ,and I see an constantly increasing time in total execution.

On 40 iterations, total time increased almost twice, and on 90 iterations it increased in 3.5 times(still, NOT full power of CPU as it have 96 cores!!).

I'm aware of threading pool, and time needed to create such amount of threads but still, that seems not working well at all. Does omp have specific problems with Intel Xeon Platinum processors, that I not aware of it? Maybe something about it's architecture, that not complied with omp.

* It is not a problem of cooling hardware, since tests are running about 3 minutes.

* It is not a problem of allocations or memory copy, since there is exactly same amount of memory allocated and memory copied.

Could you think of a possible problems, because I have run out of ideas.

Thanks

Igor_A_Intel · ‎11-01-2019

Hi Gregory,

Is there any IPP issue you've faced with? Which IPP version? Which IPP functionality?

regards, Igor

Nikolaev__Gregory · ‎11-02-2019

Sorry. It seems that this post was more appropriate to be post in Intel® C++ Compiler forum. No. I wasn't using IPP functions. Is there a way to move this post?

Gennady_F_Intel · ‎11-03-2019

moved this thread to the Intel compiler c/c++ forum

jimdempseyatthecove · ‎11-04-2019

We need more information about your application.

In particular, a sketch of where you place the omp directives (and what they are), and what type of code is being called.

For example, if for each iteration, you have for that iteration, a loop that you parallelized. What is the iteration count of that loop? (e.g. number of particles???).

Further, do the functions used by each thread within this parallel loop call functions that itself contains parallel loops (IOW nested parallelism).
*** Note, if the answer is NO, because your called code is not containing omp directived, then should these functions contain calls to the MKL library, you must link in the Single-Threaded MKL library. Using the Multi-Threaded MKL library by a thread within a parallel region will initiiate nested parallel regions. IOW cause your program to generate 96*96 threads (and suffer oversubscription).

For parallel program - use single threaded MKL library
For serial program - use multi-threaded MKL library

Jim Dempsey

Nikolaev__Gregory · ‎11-05-2019

The whole code is really complicated. I will give you a brief sketch here:

... // Some Code here.

.....

for LOOP (dynamic number of iterations) // omp pragma parallel for on this loop

{

...// Some code here

Called function, lets say MyFunc, that consumes 85% of a CPU time , does a lot of calculations, matrix operations, etc.

...// Some code here

}

....// Some Code here

* as for a number of iterations - as I mentioned in original post, I have tested different number of iterations, even exceeding the number of CORES twice, just to check stability of code (in real world, it will be around 20 iterations).

* about code inside MyFunc - there are almost no memory allocations, almost all code is about calculations and matrix operations, and for operations I mean only reading. Sometimes, I believe, some threads will fall on same index in matrix simultaneously. Don't think it can interfere too much. Sure, other functions called inside, but there is no parallelism them. This is the only place.

* I wasn't using MKL library before. I mean, OMP is not a part of MKL library, so it doesn't have to be installed.

* But the most interesting part, is about such different behavior on different CPU's of the SAME code(with same configurations) . I thought, that maybe because there is 1 physical CPU on 16 PC and 4 physical CPU's on 96 PC, there is an overhead communicating between different CPU's .

jimdempseyatthecove · ‎11-05-2019

Try using OpenMP tasking:

#pragma omp parallel
{
  #pragma omp master
  {
    for(int i=0; i<outerCount; ++i)
    {
      #pragma omp task
      {
         MyFunc(i);
      }
    }
  }
}
...
void MyFunc(i)
{
   int i;
   ...
   // at this point you have two independent sections A and B
   #pragma omp task
   {
      ... // section A
   }
   #pragme omp task
   {
      ... // section B
   }
   #pragma omp taskwait
  ...
}

The tasks that you launch have to have sufficient amount of work to be effective.

I did not show any of the potential OpenMP clauses (e.g. private, shared, reduce....)

Jim Dempsey

jimdempseyatthecove · ‎11-05-2019

I had written a nicer post prior to #7, but the IDZ forum logged my out (rebooted) while composing the message.

Recap of that message:

If your parallelization is only at this outer loop

for LOOP (dynamic number of iterations) // omp pragma parallel for on this loop

then the degree of parallelization is at most the number of iterations of that loop.

For a maximum count of 20, then only 20 of your 96 threads (or 192 threads) would be utilized. To achieve greater parallelization, you will need to dig deeper into your code, as outlined in post #7, or use nested OpenMP parallel regions. Though using OpenMP Tasks as in #7 may be more effective. Nested parallel regions efficiency is an algorithmic art. A "best" strategy is to determine the largest outer loop thread count that produces the least value remainder of the modulus of the outer loop count with inner loop count (2 through outer loop count) *** with special case for outer loop count mod max threads == 0, and in which case inner loop is not parallelized.

Nested parallel regions will work best/better when the outer loop count does not vary within a run of the application.

Jim Dempsey

Nikolaev__Gregory · ‎11-06-2019

Thanks for your response. Unfortunately, I am using VS 2017 that doesn't support task directive, as it seems

https://stackoverflow.com/questions/23545930/openmp-tasks-in-visual-studio
I got "Error C3001 'task': expected an OpenMP directive name".

Can you please explain a bit of this part. Not so clear :

"A "best" strategy is to determine the largest outer loop thread count that produces the least value remainder of the modulus of the outer loop count with inner loop count (2 through outer loop count) *** with special case for outer loop count mod max threads == 0, and in which case inner loop is not parallelized."

* I installed MKL library and ran new code, very slight improvement.

* I already tried nested omp pragma(there is also a FOR loop inside MyFunc, that does a lot of work) , still a performance very low.

jimdempseyatthecove · ‎11-06-2019

Your objective is, provided that there is sufficient work for each thread used, to incorporate the most threads possible. (there are cases where fewer than most threads possible is better, but ignore this for now).

Assume for this example each iteration of your loop as executed in single thread mode.
Assume your outer loop has 20 iterations
Assume the MyFunc code has significant (almost all) code that can be processed by 2 threads

using 1 thread in MyFunc:
16 iterations run in parallel
+4 iterations run in parallel, 12 threads idel (not used)

Total time is 2/16 of serial

using 2 threads in MyFunc
8 iterations in parallel, executed by 2 threads in parallel
+8 iterations in parallel, executed by 2 threads in parallel
+4 iterations in parallel, executed by 2 threads in parallel, with 8 threads idel (not used)

in the second case you've kept more threads working

The strategy is: given the number of iterations (degree of parallelization) of the outer loop, the number of threads available, and the degree of parallelization of the inner section, determine the mix of threads used on the outer loop and inner code such that the most threads are kept busy.
*** when the circumstance is such that you have multiple most used combinations, take the one with the most threads on the outer loop.

With your 96 cores and 20 iterations of outer loop:

96 - 20 = 76 number of threads remaining
76 / 20 = 3.8
the MyFunc would use 3 threads and have to have at least 3 degrees of parallelization
20 * 3 = 60 threads utilized

Or consider using 10 threads on outer loop
96 - 10 = 86
86 / 10 = 8.6
the MyFunc would use 8 threads and have to have at least 8 degrees of parallelization
10 * 8 = 80 threads utilized

Choice depends on the degree to which the inner code can be parallelized

Jim Dempsey

jimdempseyatthecove · ‎11-06-2019

>> I already tried nested omp pragma(there is also a FOR loop inside MyFunc, that does a lot of work) , still a performance very low.

In order to use nested parallel regions you must instruct OpenMP to enable nested parallel regions (disabled by default).

Use environment variable:

OMP_NESTED=true

or function call:

omp_set_nested(1); // *** prior to first parallel region

Then something like this **untested** sketch;

int outerCount = N; // number between say 1 and 20 
int MaxThreads = omp_get_max_threads();
int OuterLoopThreadCount = std::min(MaxThreads, N);
int InnerLoopThreadCount = std::max(1, N / OuterLoopThreadCount);


int bestOuterLoopThreadCount = OuterLoopThreadCount;
int bestInnerLoopThreadCount = InnerLoopThreadCount;
int bestIdelThreadCount = (N%OuterLoopThreadCount) * InnerLoopThreadCount;

// find minimum idel thread count for last parallel iteration of outer loop
for(;bestIdelThreadCount;)
{
  if(OuterLoopThreadCount == 1)
    break;
  --OuterLoopThreadCount;
  InnerLoopThreadCount = std::max(1, N / OuterLoopThreadCount);
  IdelThreadCount = (N%OuterLoopThreadCount) * InnerLoopThreadCount;
  if(IdelThreadCount < bestIdelThreadCount)
  {
    bestOuterLoopThreadCount = OuterLoopThreadCount;
    bestInnerLoopThreadCount = InnerLoopThreadCount;
    bestIdelThreadCount = IdelThreadCount;
  }
}

OuterLoopThreadCount = bestOuterLoopThreadCount;
InnerLoopThreadCount = bestInnerLoopThreadCount;
IdelThreadCount = bestIdelThreadCount;

...
if(outerCount == 1)
  MyFunc(0);
else
{
  
  #pragma omp parallel for num_threads(OuterLoopThreadCount)
  for(int i=0; i<outerCount; ++i)
  {
    MyFunc(i);
  }
}


...
void MyFunc(int i)
{
   ...
   
   #pragma omp parallel for num_threads(InnerLoopThreadCount)
   for(int j=0; j<innerLoopCount; ++j)
   {
      ...
   }
}

*** You must assure thread-safe-ness of your code. This generally involves adding clause private(...), firstprivate(...), reduction(...), lastprivate(...) and/or other clauses.

Jim Dempsey

Nikolaev__Gregory · ‎11-07-2019

OK . I have tried it different ways.

As for your #11 example. I have inner loop count iterations for about 720.

So I tried to run nested omp() - enabled omp_set_nested(1).

First tried 20 threads in outer loop and 3/4 threads in inner loop.

Second tried 10 threads in outer loop and 8/9/10 threads in inner loop. Not just there wasn't improvement, but it get worse time in 4.5 times!!!

Than tried to run parallelism only on inner LOOP. Since there should be much better use of CPU power (since number of iterations is greater than CORES), I expected to get even better results (instead of parallelizing only outer loop with 20 threads).And once again, I got time worse in 5 times (((.

Thanks for your help, but it seems that there is just no solution.

BTW - there are configurations for optimization for Intel Xeon Platinum CPU in BIOS, that I wasn't aware of them. Worth noting in the future, if someone is get stuck like me, thought didn't help.

I will copy his answer here:

"The Core i9-7960x processor has 16 physical cores in one chip. It is capable of supporting HyperThreading, which if enabled would give the appearance of 32 cores.

The Xeon Platinum 8168 processor has 24 physical cores in one chip. It is capable of supporting HyperThreading, which if enabled would give the appearance of 48 cores. The Xeon Platinum 8168 supports systems with up to 8 chips, so a system that reports 96 Logical Processors is either a 2-socket system with HyperThreading enabled, or a 4-socket system with HyperThreading disabled.

The easiest way to get ~1/2 performance is to run one thread per physical core on the Core i9-7960x, but to run two threads per physical core on the Xeon Platinum 8168.

The binding of threads to cores is difficult to control in a completely system-independent fashion. If you want to compare single-socket performance, you will need to limit the code to running on a single socket of the Xeon Platinum 8168. In Linux systems, this is typically done with a command like:

numactl --membind=0 --cpunodebind=0 a.out

This forces all memory and processors to be allocated from socket "0". You should then set OMP_NUM_THREADS to the number of threads you want to test, and (MOST IMPORTANT) set OMP_PROC_BIND=spread. The "spread" option will cause the OpenMP runtime to distribute the threads as far apart as possible. Assuming that the Xeon Platinum 8168 system has HyperThreading enabled, the "spread" option will cause one thread to be bound to each *physical* core until all physical cores have been used. For this single socket case you will be able to use up to 24 threads without "doubling up" on any physical core in socket 0.

Some OpenMP jobs will run better if the threads are spread uniformly across the sockets. Again assuming that the Xeon Platinum 8168 system has two sockets and HyperThreading enabled, you simply set OMP_PROC_BIND=spread, and then set OMP_NUM_THREADS to the desired number of threads. Since there are two sockets, you probably want to set the number of threads to an even value. The "spread" option will place 1/2 of the threads on separate physical cores in the first socket and will place the other half of the threads on separate physical cores in the second socket.

"Dr. Bandwidth"

jimdempseyatthecove · ‎11-07-2019

Do you have Intel's VTune? If so, I suggest you learn how to use it. While it may be the case that your program is incapable of threading to 96 threads, it is often the case that a simplified approach is not the best way to parallelize your application. Have you considered seeking professional help?

Jim Dempsey