- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have been doing experiments on Xeon Phi and ran an fibonacci(40) application with varied number of core allocations (i.e. number of cores). The energy consumption was measured through /sys/class/micras/power and performance (i.e. execution time) was measured using elapsed time.
I get the following trade-off:
Cores Execution_time(ms) Energy(J)
1 142450.897 11123.35
2 312780.938 24676.25
4 172721.295 13849
8 104500.9 8527.75
16 60941.152 5176.8
24 43881.031 3854.75
32 31231.054 2828.1
40 25801.364 2450.9
48 23231.324 2267.2
56 19880.978 1997.7
61 17380.948 1784.5
As you can see there is a sudden jump in the execution time from core allocation of 1 to 2, which is also causing the energy to be much higher. Other number of core allocations are I suppose normal. But I cannot explain this trade-off when the core allocation increases from 1 to 2.
Here is the application if you'd like to try it out:
------------
int fib(int n)
{
int i, j;
if (n<2)
return n;
else
{
//omp_set_num_threads(NUM_CPUS);
#pragma omp task shared(i) firstprivate(n)
i=fib(n-1);
#pragma omp task shared(j) firstprivate(n)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
int main()
{
int n = 40;
omp_set_dynamic(1);
#pragma omp parallel shared(n)
{
#pragma omp single
printf ("\033[37;1mfib(%d) = %d\033[0m \n", n, fib(n));
#pragma omp single
printf ("CPUs in Parallel = %d \033\n", omp_get_num_threads());
}
return 0;
}
------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, when you say you are using 1 core, you mean you are using OMP_NUM_THREADS = 1 and when you say you are using 2 cores, you mean you are using OMP_NUM_THREADS = 2? By setting KMP_AFFINITY to compact, I don't think you're doing what you think you're doing.
According to the C++ user guide, setting KMP_AFFINITY to compact means "sequentially distribute the threads among the cores that share the same cache." Instead of OMP_NUM_THREADS, you should probably use KMP_PLACE_THREADS. Again from the C++ reference guide, "The KMP_PLACE_THREADS variable controls the hardware threads that will be used by the program. This variable specifies the number of cores to use and how many threads to assign per core." Without this variable and using a compact affinity, the operating system will try to use all 4 threads on each core, keeping the first 4 OMP threads on the first core, the second 4 OMP threads on the second core and so on. By the way, you shouldn't explicitly set OMP_NUM_THREADS when you use KMP_PLACE_THREADS; let the operating system figure out how many OMP threads it needs.
As far as the jump when you go from 1 thread to 2 threads, 1 is a special case. You never need to wait when you write to i and j and you never get stuck waiting at the '#pragma omp taskwait'. The rest of the numbers, from 2 threads to 61 threads, follow a nice curve downward, flattening out when you have more threads than work - above 40 threads, you have some threads whose only job is to return a 1 or a 0.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How are you defining your KMP_AFFINITY?
Was the one core run made with OpenMP disabled, or with OpenMP enabled and one thread specified for the parallel region?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim Depmsey
Thank you for your response. The KMP_AFFINITY was set to compact and all core allocations were run using openmp library, setting the number of threads using OMP_NUM_THREADS (from 1 to 61).
Thanks
Rishad
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So you are using just 1 core up to 4 threads?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, when you say you are using 1 core, you mean you are using OMP_NUM_THREADS = 1 and when you say you are using 2 cores, you mean you are using OMP_NUM_THREADS = 2? By setting KMP_AFFINITY to compact, I don't think you're doing what you think you're doing.
According to the C++ user guide, setting KMP_AFFINITY to compact means "sequentially distribute the threads among the cores that share the same cache." Instead of OMP_NUM_THREADS, you should probably use KMP_PLACE_THREADS. Again from the C++ reference guide, "The KMP_PLACE_THREADS variable controls the hardware threads that will be used by the program. This variable specifies the number of cores to use and how many threads to assign per core." Without this variable and using a compact affinity, the operating system will try to use all 4 threads on each core, keeping the first 4 OMP threads on the first core, the second 4 OMP threads on the second core and so on. By the way, you shouldn't explicitly set OMP_NUM_THREADS when you use KMP_PLACE_THREADS; let the operating system figure out how many OMP threads it needs.
As far as the jump when you go from 1 thread to 2 threads, 1 is a special case. You never need to wait when you write to i and j and you never get stuck waiting at the '#pragma omp taskwait'. The rest of the numbers, from 2 threads to 61 threads, follow a nice curve downward, flattening out when you have more threads than work - above 40 threads, you have some threads whose only job is to return a 1 or a 0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI, you can refer to this article ( https://software.intel.com/en-us/articles/best-known-methods-for-using-openmp-on-intel-many-integrated-core-intel-mic-architecture ) to set the variable KMP_PLACE_THREADS accordingly. In your case, you can set KMP_PLACE_THREADS=60c,1t to run on 60 cores with one thread per core.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or you can use KMP_AFFINITY=scatter
It appears that the OpenMP implementation is optimized for the 1 thread case whereby none of the atomic (interlocked) instructions are used and/or critical sections. Consider not using the 1-thread run data in calculating scaling verses power.
If you were going to run code with 1 thread, you would certainly run it on the host.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your responses.
I have now used KMP_AFFINITY=scatter,verbose and KMP_PLACE_THREADS=Nc,1t (where N is a variable). I have run the experiments to check the trade-offs again -- this time with fixed CPU frequency of 0.6GHz. These are the results I got:
Cores Execution_time(ms) Energy(Joules)
1 284972.988 16422.1
2 604034.468 35007.55
4 337416.392 19764.45
8 205400.781 11764.8
16 116997.936 6944.8
24 81933.451 4964.6
32 64192.39 3964.25
40 49451.153 3125.35
48 40838.395 2628.85
56 37278.482 2422.7
61 33192.005 2192.7
As you can see, the execution time of 1 core is about the same as 4+ cores and the energy is also lower than 4+ cores. As Frances pointed out, this might be a thread synchronization issue specifically for this application. I will see if I can get different trade-offs using other appplications.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With a perfect system, you'd expect the energy to be flat. Its not.
Systems, and algorithms won't produce a flat energy curve. Example
If the single thread version of the code is completely optimized with respect to using the cache system, and memory system, then adding additional threads might increase the energy consumed.
In looking at the execution times of 1 thread verses 2 threads, you see it is taking a bit more than 2x longer. This will occur under the following circumstances:
a) The amount of work per parallel region is too small (too fine grained), and
b) The work per thread per parallel region is not equal. This results in the threads with lesser work having nothing to do but spin-wait waiting for the thread with the most work to finish the parallel region.
Fortunately, in most cases, both a) and b) can be corrected with programming changes.
The run times between 1 core and 2 cores are strongly indicative of inefficient use of OpenMP.
This is not meant to be an insult to your programming in general, rather it means you have some things to learn as to how to use OpenMP with your program.
In looking at your data (1 core vs 2 cores), in a perfect scaling world, 2 cores would have finished in half the time, say for sake of argument the execution time would have been on the order of 150000ms. You are observing 600000ms, or 4x longer than expected. What this means to me is you have some "low hanging fruit" (easy pickings) of optimization opportunities. It will be relatively easy for you to identify these opportunities. You just have to learn how.
The fibbonacci program is a great example of nested parallelism. However, it is the absolute worst way to solve this problem. A simple single thread iterative method is much faster. And there computational expressions that produce the result without iteration.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim Depmsey
I do not get involved in application programming, I am more interested in developing application-independent runtime management approaches (which includes controlling DVFS and DPM features). What you mentioned about fibonacci is spot on -- I have been running other (more proper) benchmark applications, too and I see that the trend with core scaling is as expected:
(Results from NAS parallel benchmark - bt - class C)
Cores Execution_time(ms) Energy(Joules)
1 18235099.351 1197526.1
2 8504594.628 563522.25
4 4575615.503 308273.4
8 2377755.737 165943
16 1240535.059 90906.3
24 860205.422 65783.35
32 616474.848 49306.8
40 499815.346 41564
48 465785.544 39947.9
56 366635.297 32812.45
61 367135.344 33231.15
Thank you for all your help.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page