Software Archive
Read-only legacy content
17061 Discussions

Xeon Phi Performance / Energy Tradeoff Issues

Rishad_S_
Beginner
1,438 Views

Hi

I have been doing experiments on Xeon Phi and ran an fibonacci(40) application with varied number of core allocations (i.e. number of cores). The energy consumption was measured through /sys/class/micras/power and performance (i.e. execution time) was measured using elapsed time.

I get the following trade-off:

Cores Execution_time(ms) Energy(J)
1    142450.897    11123.35
2    312780.938    24676.25
4    172721.295    13849
8    104500.9    8527.75
16    60941.152    5176.8
24    43881.031    3854.75
32    31231.054    2828.1
40    25801.364    2450.9
48    23231.324    2267.2
56    19880.978    1997.7
61    17380.948    1784.5

As you can see there is a sudden jump in the execution time from core allocation of 1 to 2, which is also causing the energy to be much higher. Other number of core allocations are I suppose normal. But I cannot explain this trade-off when the core allocation increases from 1 to 2.

Here is the application if you'd like to try it out:

------------

int fib(int n)

{
  int i, j;
  if (n<2)
    return n;
  else
    {
      //omp_set_num_threads(NUM_CPUS);

       #pragma omp task shared(i) firstprivate(n)
       i=fib(n-1);

       #pragma omp task shared(j) firstprivate(n)
       j=fib(n-2);

       #pragma omp taskwait
       return i+j;
    }
}

int main()
{
  int n = 40;
  omp_set_dynamic(1);
  #pragma omp parallel shared(n)
  {
    #pragma omp single
    printf ("\033[37;1mfib(%d) = %d\033[0m   \n", n, fib(n));
    #pragma omp single
    printf ("CPUs in Parallel = %d \033\n", omp_get_num_threads());
  }
  return 0;
}

------------

0 Kudos
1 Solution
Frances_R_Intel
Employee
1,438 Views

So, when you say you are using 1 core, you mean you are using OMP_NUM_THREADS = 1 and when you say you are using 2 cores, you mean you are using OMP_NUM_THREADS = 2? By setting KMP_AFFINITY to compact, I don't think you're doing what you think you're doing.

According to the C++ user guide, setting KMP_AFFINITY to compact means "sequentially distribute the threads among the cores that share the same cache." Instead of OMP_NUM_THREADS, you should probably use KMP_PLACE_THREADS. Again from the C++ reference guide, "The KMP_PLACE_THREADS variable controls the hardware threads that will be used by the program. This variable specifies the number of cores to use and how many threads to assign per core." Without this variable and using a compact affinity, the operating system will try to use all 4 threads on each core, keeping the first 4 OMP threads on the first core, the second 4 OMP threads on the second core and so on. By the way, you shouldn't explicitly set OMP_NUM_THREADS when you use KMP_PLACE_THREADS; let the operating system figure out how many OMP threads it needs.

As far as the jump when you go from 1 thread to 2 threads, 1 is a special case. You never need to wait when you write to i and j and you never get stuck waiting at the '#pragma omp taskwait'. The rest of the numbers, from 2 threads to 61 threads, follow a nice curve downward, flattening out when you have more threads than work - above 40 threads, you have some threads whose only job is to return a 1 or a 0.

View solution in original post

0 Kudos
10 Replies
jimdempseyatthecove
Honored Contributor III
1,438 Views

How are you defining your KMP_AFFINITY?

Was the one core run made with OpenMP disabled, or with OpenMP enabled and one thread specified for the parallel region?

Jim Dempsey

0 Kudos
Rishad_S_
Beginner
1,438 Views

Hi Jim Depmsey

Thank you for your response. The KMP_AFFINITY was set to compact and all core allocations were run using openmp library, setting the number of threads using OMP_NUM_THREADS (from 1 to 61).

Thanks

Rishad

0 Kudos
TimP
Honored Contributor III
1,438 Views

So you are using just 1 core up to 4 threads?

0 Kudos
Frances_R_Intel
Employee
1,439 Views

So, when you say you are using 1 core, you mean you are using OMP_NUM_THREADS = 1 and when you say you are using 2 cores, you mean you are using OMP_NUM_THREADS = 2? By setting KMP_AFFINITY to compact, I don't think you're doing what you think you're doing.

According to the C++ user guide, setting KMP_AFFINITY to compact means "sequentially distribute the threads among the cores that share the same cache." Instead of OMP_NUM_THREADS, you should probably use KMP_PLACE_THREADS. Again from the C++ reference guide, "The KMP_PLACE_THREADS variable controls the hardware threads that will be used by the program. This variable specifies the number of cores to use and how many threads to assign per core." Without this variable and using a compact affinity, the operating system will try to use all 4 threads on each core, keeping the first 4 OMP threads on the first core, the second 4 OMP threads on the second core and so on. By the way, you shouldn't explicitly set OMP_NUM_THREADS when you use KMP_PLACE_THREADS; let the operating system figure out how many OMP threads it needs.

As far as the jump when you go from 1 thread to 2 threads, 1 is a special case. You never need to wait when you write to i and j and you never get stuck waiting at the '#pragma omp taskwait'. The rest of the numbers, from 2 threads to 61 threads, follow a nice curve downward, flattening out when you have more threads than work - above 40 threads, you have some threads whose only job is to return a 1 or a 0.

0 Kudos
Loc_N_Intel
Employee
1,438 Views

FYI, you can refer to this article ( https://software.intel.com/en-us/articles/best-known-methods-for-using-openmp-on-intel-many-integrated-core-intel-mic-architecture ) to set the variable KMP_PLACE_THREADS accordingly. In your case, you can set KMP_PLACE_THREADS=60c,1t to run on 60 cores with one thread per core.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,438 Views

Or you can use KMP_AFFINITY=scatter

It appears that the OpenMP implementation is optimized for the 1 thread case whereby none of the atomic (interlocked) instructions are used and/or critical sections. Consider not using the 1-thread run data in calculating scaling verses power.

If you were going to run code with 1 thread, you would certainly run it on the host.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,438 Views
If you set kmp_place_threads=60 or less, you will get 1 thread per core up to that num . MicsMP-GUI is useful to see how your work is distributed across cores. Your test by avoiding simd will assure uncompetitive performance.
0 Kudos
Rishad_S_
Beginner
1,438 Views

Thank you for your responses.

I have now used KMP_AFFINITY=scatter,verbose and KMP_PLACE_THREADS=Nc,1t (where N is a variable). I have run the experiments to check the trade-offs again -- this time with fixed CPU frequency of 0.6GHz. These are the results I got:

Cores Execution_time(ms) Energy(Joules)
1    284972.988    16422.1
2    604034.468    35007.55
4    337416.392    19764.45
8    205400.781    11764.8
16    116997.936    6944.8
24    81933.451    4964.6
32    64192.39    3964.25
40    49451.153    3125.35
48    40838.395    2628.85
56    37278.482    2422.7
61    33192.005    2192.7

As you can see, the execution time of 1 core is about the same as 4+ cores and the energy is also lower than 4+ cores. As Frances pointed out, this might be a thread synchronization issue specifically for this application. I will see if I can get different trade-offs using other appplications.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,438 Views

With a perfect system, you'd expect the energy to be flat. Its not.

Systems, and algorithms won't produce a flat energy curve. Example

If the single thread version of the code is completely optimized with respect to using the cache system, and memory system, then adding additional threads might increase the energy consumed.

In looking at the execution times of 1 thread verses 2 threads, you see it is taking a bit more than 2x longer. This will occur under the following circumstances:

a) The amount of work per parallel region is too small (too fine grained), and

b) The work per thread per parallel region is not equal. This results in the threads with lesser work having nothing to do but spin-wait waiting for the thread with the most work to finish the parallel region.

Fortunately, in most cases, both a) and b) can be corrected with programming changes.

The run times between 1 core and 2 cores are strongly indicative of inefficient use of OpenMP.

This is not meant to be an insult to your programming in general, rather it means you have some things to learn as to how to use OpenMP with your program.

In looking at your data (1 core vs 2 cores), in a perfect scaling world, 2 cores would have finished in half the time, say for sake of argument the execution time would have been on the order of 150000ms. You are observing 600000ms, or 4x longer than expected. What this means to me is you have some "low hanging fruit" (easy pickings) of optimization opportunities. It will be relatively easy for you to identify these opportunities. You just have to learn how.

The fibbonacci program is a great example of nested parallelism. However, it is the absolute worst way to solve this problem. A simple single thread iterative method is much faster. And there computational expressions that produce the result without iteration.

Jim Dempsey

0 Kudos
Rishad_S_
Beginner
1,438 Views

Hi Jim Depmsey

I do not get involved in application programming, I am more interested in developing application-independent runtime management approaches (which includes controlling DVFS and DPM features). What you mentioned about fibonacci is spot on -- I have been running other (more proper) benchmark applications, too and I see that the trend with core scaling is as expected:

(Results from NAS parallel benchmark - bt - class C)

Cores    Execution_time(ms) Energy(Joules)
1    18235099.351    1197526.1
2    8504594.628    563522.25
4    4575615.503    308273.4
8    2377755.737    165943
16    1240535.059    90906.3
24    860205.422    65783.35
32    616474.848    49306.8
40    499815.346    41564
48    465785.544    39947.9
56    366635.297    32812.45
61    367135.344    33231.15

Thank you for all your help.

0 Kudos
Reply