Solved: Xeon Phi Performance / Energy Tradeoff Issues

Rishad_S_ · ‎01-29-2015

Hi

I have been doing experiments on Xeon Phi and ran an fibonacci(40) application with varied number of core allocations (i.e. number of cores). The energy consumption was measured through /sys/class/micras/power and performance (i.e. execution time) was measured using elapsed time.

I get the following trade-off:

Cores Execution_time(ms) Energy(J)
1   142450.897   11123.35
2   312780.938   24676.25
4   172721.295   13849
8   104500.9   8527.75
16   60941.152   5176.8
24   43881.031   3854.75
32   31231.054   2828.1
40   25801.364   2450.9
48   23231.324   2267.2
56   19880.978   1997.7
61   17380.948   1784.5

As you can see there is a sudden jump in the execution time from core allocation of 1 to 2, which is also causing the energy to be much higher. Other number of core allocations are I suppose normal. But I cannot explain this trade-off when the core allocation increases from 1 to 2.

Here is the application if you'd like to try it out:

------------

int fib(int n)

{
int i, j;
if (n<2)
    return n;
else
    {
      //omp_set_num_threads(NUM_CPUS);

#pragma omp task shared(i) firstprivate(n)
i=fib(n-1);

#pragma omp task shared(j) firstprivate(n)
j=fib(n-2);

       #pragma omp taskwait
       return i+j;
    }
}

int main()
{
int n = 40;
omp_set_dynamic(1);
#pragma omp parallel shared(n)
{
    #pragma omp single
    printf ("\033[37;1mfib(%d) = %d\033[0m   \n", n, fib(n));
    #pragma omp single
    printf ("CPUs in Parallel = %d \033\n", omp_get_num_threads());
}
return 0;
}

------------

Frances_R_Intel · ‎01-29-2015

So, when you say you are using 1 core, you mean you are using OMP_NUM_THREADS = 1 and when you say you are using 2 cores, you mean you are using OMP_NUM_THREADS = 2? By setting KMP_AFFINITY to compact, I don't think you're doing what you think you're doing.

According to the C++ user guide, setting KMP_AFFINITY to compact means "sequentially distribute the threads among the cores that share the same cache." Instead of OMP_NUM_THREADS, you should probably use KMP_PLACE_THREADS. Again from the C++ reference guide, "The KMP_PLACE_THREADS variable controls the hardware threads that will be used by the program. This variable specifies the number of cores to use and how many threads to assign per core." Without this variable and using a compact affinity, the operating system will try to use all 4 threads on each core, keeping the first 4 OMP threads on the first core, the second 4 OMP threads on the second core and so on. By the way, you shouldn't explicitly set OMP_NUM_THREADS when you use KMP_PLACE_THREADS; let the operating system figure out how many OMP threads it needs.

As far as the jump when you go from 1 thread to 2 threads, 1 is a special case. You never need to wait when you write to i and j and you never get stuck waiting at the '#pragma omp taskwait'. The rest of the numbers, from 2 threads to 61 threads, follow a nice curve downward, flattening out when you have more threads than work - above 40 threads, you have some threads whose only job is to return a 1 or a 0.

View solution in original post

jimdempseyatthecove · ‎01-29-2015

How are you defining your KMP_AFFINITY?

Was the one core run made with OpenMP disabled, or with OpenMP enabled and one thread specified for the parallel region?

Jim Dempsey

Rishad_S_ · ‎01-29-2015

Hi Jim Depmsey

Thank you for your response. The KMP_AFFINITY was set to compact and all core allocations were run using openmp library, setting the number of threads using OMP_NUM_THREADS (from 1 to 61).

Thanks

Rishad

TimP · ‎01-29-2015

So you are using just 1 core up to 4 threads?

Frances_R_Intel · ‎01-29-2015

So, when you say you are using 1 core, you mean you are using OMP_NUM_THREADS = 1 and when you say you are using 2 cores, you mean you are using OMP_NUM_THREADS = 2? By setting KMP_AFFINITY to compact, I don't think you're doing what you think you're doing.

According to the C++ user guide, setting KMP_AFFINITY to compact means "sequentially distribute the threads among the cores that share the same cache." Instead of OMP_NUM_THREADS, you should probably use KMP_PLACE_THREADS. Again from the C++ reference guide, "The KMP_PLACE_THREADS variable controls the hardware threads that will be used by the program. This variable specifies the number of cores to use and how many threads to assign per core." Without this variable and using a compact affinity, the operating system will try to use all 4 threads on each core, keeping the first 4 OMP threads on the first core, the second 4 OMP threads on the second core and so on. By the way, you shouldn't explicitly set OMP_NUM_THREADS when you use KMP_PLACE_THREADS; let the operating system figure out how many OMP threads it needs.

As far as the jump when you go from 1 thread to 2 threads, 1 is a special case. You never need to wait when you write to i and j and you never get stuck waiting at the '#pragma omp taskwait'. The rest of the numbers, from 2 threads to 61 threads, follow a nice curve downward, flattening out when you have more threads than work - above 40 threads, you have some threads whose only job is to return a 1 or a 0.

Loc_N_Intel · ‎01-29-2015

FYI, you can refer to this article ( https://software.intel.com/en-us/articles/best-known-methods-for-using-openmp-on-intel-many-integrated-core-intel-mic-architecture ) to set the variable KMP_PLACE_THREADS accordingly. In your case, you can set KMP_PLACE_THREADS=60c,1t to run on 60 cores with one thread per core.

jimdempseyatthecove · ‎01-29-2015

Or you can use KMP_AFFINITY=scatter

It appears that the OpenMP implementation is optimized for the 1 thread case whereby none of the atomic (interlocked) instructions are used and/or critical sections. Consider not using the 1-thread run data in calculating scaling verses power.

If you were going to run code with 1 thread, you would certainly run it on the host.

Jim Dempsey

TimP · ‎01-29-2015

If you set kmp_place_threads=60 or less, you will get 1 thread per core up to that num . MicsMP-GUI is useful to see how your work is distributed across cores. Your test by avoiding simd will assure uncompetitive performance.

Rishad_S_ · ‎01-30-2015

Thank you for your responses.

I have now used KMP_AFFINITY=scatter,verbose and KMP_PLACE_THREADS=Nc,1t (where N is a variable). I have run the experiments to check the trade-offs again -- this time with fixed CPU frequency of 0.6GHz. These are the results I got:

Cores Execution_time(ms) Energy(Joules)
1   284972.988   16422.1
2   604034.468   35007.55
4   337416.392   19764.45
8   205400.781   11764.8
16   116997.936   6944.8
24   81933.451   4964.6
32   64192.39   3964.25
40   49451.153   3125.35
48   40838.395   2628.85
56   37278.482   2422.7
61   33192.005   2192.7

As you can see, the execution time of 1 core is about the same as 4+ cores and the energy is also lower than 4+ cores. As Frances pointed out, this might be a thread synchronization issue specifically for this application. I will see if I can get different trade-offs using other appplications.

jimdempseyatthecove · ‎01-30-2015

With a perfect system, you'd expect the energy to be flat. Its not.

Systems, and algorithms won't produce a flat energy curve. Example

If the single thread version of the code is completely optimized with respect to using the cache system, and memory system, then adding additional threads might increase the energy consumed.

In looking at the execution times of 1 thread verses 2 threads, you see it is taking a bit more than 2x longer. This will occur under the following circumstances:

a) The amount of work per parallel region is too small (too fine grained), and

b) The work per thread per parallel region is not equal. This results in the threads with lesser work having nothing to do but spin-wait waiting for the thread with the most work to finish the parallel region.

Fortunately, in most cases, both a) and b) can be corrected with programming changes.

The run times between 1 core and 2 cores are strongly indicative of inefficient use of OpenMP.

This is not meant to be an insult to your programming in general, rather it means you have some things to learn as to how to use OpenMP with your program.

In looking at your data (1 core vs 2 cores), in a perfect scaling world, 2 cores would have finished in half the time, say for sake of argument the execution time would have been on the order of 150000ms. You are observing 600000ms, or 4x longer than expected. What this means to me is you have some "low hanging fruit" (easy pickings) of optimization opportunities. It will be relatively easy for you to identify these opportunities. You just have to learn how.

The fibbonacci program is a great example of nested parallelism. However, it is the absolute worst way to solve this problem. A simple single thread iterative method is much faster. And there computational expressions that produce the result without iteration.

Jim Dempsey

Rishad_S_ · ‎02-03-2015

Hi Jim Depmsey

I do not get involved in application programming, I am more interested in developing application-independent runtime management approaches (which includes controlling DVFS and DPM features). What you mentioned about fibonacci is spot on -- I have been running other (more proper) benchmark applications, too and I see that the trend with core scaling is as expected:

(Results from NAS parallel benchmark - bt - class C)

Cores   Execution_time(ms) Energy(Joules)
1   18235099.351   1197526.1
2   8504594.628   563522.25
4   4575615.503   308273.4
8   2377755.737   165943
16   1240535.059   90906.3
24   860205.422   65783.35
32   616474.848   49306.8
40   499815.346   41564
48   465785.544   39947.9
56   366635.297   32812.45
61   367135.344   33231.15

Thank you for all your help.