Link Copied
Hi Andreas,
I can answer a few of your questions.
1) installing an OpenMP-aware compiler
OpenMP is an organization (and specification) that specifies a set of compiler directives as a set of #pragmas for C/C++ and comments for FORTRAN. So if you are not using an OpenMP aware compiler you are not by definition using OpenMP.
My OpenMP environment is WindowsXP Pro and Intel Visual Fortran. The systems I run on are two Intel HT only systems and one dual processor/4 core AMDOpteron system. Initialy I had problems on the HT only systems were I observed a decrease in performance. Now I can see up to 30% improvement on the HT only systems. On the 4-core Opteron I can see about 300% over single core. Some sections of code experience ~400% improvementsome no improvement.
2) How can I verify that my system indeed supports OpenMP after I have installed such a compiler?
In your test program:
#pragma omp parallel sections num_threads(4)
{
#pragma omp section
f(0);
#pragma omp section
f(1);
#pragma omp section
f(2);
#pragma omp section
f(3);
}
Have the function f print the entry argument and print the OpenMP thread number.
printf should be thread safe but if you have problems with multiple threads using printf then use the entry arg to f(int i) as an index into a global table (int fTable[4];) then store the thread number into the array
fTable = omp_get_thread_num();
Then after the close of the test section print out the contents of the fTable
If you do not see four different threads then your code might not be running with OpenMP.
You do realize that the C++ OpenMP ships with no-op stub routines that can be linked into the program in lieu of the OpenMP runtime system. Have you verified that you are using the OpenMP runtime system library instead of the no-op stub routines?
3) On an OpenMP-aware compiler, is it enough to use the #pragma's or do I also need to add the -openmp option? If I get a message such as "OpenMP DEFINED SECTION WAS PARALLELIZED" during compilation, can I rest assured that the program will execute using mupltiple threads?
Get your test example working with multiple threadswhile using -openmp. Then try without -openmp. As I do not use Intel C++ I cannot try this for you. There may be an issue as to what is pulled from the runtime library. i.e. without -openmp the no-op stubs get linked in. It may be that -openmp is required on main. You can experiment if/if not.
4) --- I cannot help you here
5) --- see answer for 2)
6) - threading tools.
As you stand now it is a case of "Are you running with parallel sections or not".
For this, modify the test case such that your function f() takes 2 seconds to run using1 processor. Make this a compute bound test. The program shouldrun 8 seconds with 4 serial calls to f(). If running configured for parallel it should be much less.
Good luck
I am quiet pleased with OpenMP in Intel Visual Fortran (on Windoz). I am sure you will be too once you get the test configuration working. I think you may have a linking problem (using no-op stubs).
Jim Dempsey
The fact that you see four thread numbers is a good starting point.
Part of the OpenMP specification requires that no-op stub routines be shipped such that they can be linked into the application when running in environments where multiple threads is detrimental. Example: you ship your application out in .OBJ (.LIB) format compiled with OpenMP but linked atcustomer descretion with or without OpenMP.
If you are not seeing a speedup then either a) not much code is being executed in the f(n) or b)your application has multiple threads but your Operating System is restricting the application to one processor, or c) the code is saturating the memory bandwith.
Your parallel section code should be fine. Make a simple f(n) function that is compute intensive as well as low-memory bandwidth (or high cache usage). Note, if f(n) is simply performing a memory population then the test is limited by memory bandwidth and each processor may be flushing the other processor's cache when performing writes.
Below is a simple compute intensive application written in C that I found on the internet.
Change main(... to your f(...) and appropriately edit the printf calls.
--------- begin ------------
/*--- pi.c PROGRAM RANPI
*
* Program to compute PI by probability.
* By Mark Riordan 24-DEC-1986;
* Original version apparently by Don Shull.
* To be used as a CPU benchmark.
*
* Translated to C from FORTRAN 20 Nov 1993
*/
#include
void
myadd(float *sum,float *addend);
int
main(int argc, char *argv[]) {
float ztot, yran, ymult, ymod, x, y, z, pi, prod;
long int low, ixran, itot, j, iprod;
printf("Starting PI...
");
ztot = 0.0;
low = 1;
ixran = 1907;
yran = 5813.0;
ymult = 1307.0;
ymod = 5471.0;
itot = 1200000;
for(j=1; j<=itot; j++) {
/*
c X and Y are two uniform random numbers between 0 and 1.
c They are computed using two linear congruential generators.
c A mix of integer and real arithmetic is used to simulate a
c real program. Magnitudes are kept small to prevent 32-bit
c integer overflow and to allow full precision even with a 23-bit
c mantissa.
*/
iprod = 27611 * ixran;
ixran = iprod - 74383*(long int)(iprod/74383);
x = (float)ixran / 74383.0;
prod = ymult * yran;
yran = (prod - ymod*(long int)(prod/ymod));
y = yran / ymod;
z = x*x + y*y;
myadd(&ztot,&z);
if ( z <= 1.0 ) {
&n
bsp; low = low + 1;
}
}
printf(" x=%8.5f y=%8.5f low=%7d j=%7d
",x,y,low,j);
pi = 4.0 * (float)low/(float)itot;
printf("Pi = %9.6f ztot=%12.2f itot=%8d
",pi,ztot,itot);
return 0;
}
void
myadd(float *sum,float *addend) {
/*
c Simple adding subroutine thrown in to allow subroutine
c calls/returns to be factored in as part of the benchmark.
*/
*sum = *sum + *addend;
}
-------------------------- end --------------
If you see no speedup then you may be linking in the wrong library or the kernal is inhibiting the application from using the multiple processors.
Jim Dempsey
Andreas,
Once you resolve the issue of running on multiple processors your next job is to determine how best to divide up your code using OpenMP. There are many ways to do this. The two predominant ways are
1) Parallel-ize the large inner loops
2) Parallel-ize the outer master control loop(s)
For the code I work on (Finite Element Analysis of tension structures) I found number 2 works best. This is because I can setup each component to advance the state independently. Then at end of component state advancement I reconcile the component to component interaction. What this did for me was the reduce the number of transitions between serial sections and parallel sections. Starting and stopping threads does introduce overhead. The loop size must be large enough to overcome this overhead to yield some payback.
Other applications (yours?) might do best by parallelizing the inner loops.
The best method will depend on the application as well as the size of the data sets handled by the application.
For my application when using coarse granulation (less nodes) method 2) works best. However, as I increase the number of nodes then at some point method 1) will work best. Eventualy, I expect this application to include both methods and todeterminefrom the input data set which isbest.
To help optimize your code you should consider using a profiler. Intel has a product (for sale) called vTune. If you are on a tight budget you might consider looking at AMD Developer Center (join for free) then looking for a free tool called CodeAnalyst. This profiler is developed for AMD processors, I know you have Intel XEON processors. However, instead of refusing to work on Intel processors, CodeAnalyst simply disables the features it cannot use. What this means is you are left with a functional subset called Time Based Profiling TBP. With TBP you can locate the bottlenecks in your code which is usualy sufficient to help you tune your code. The features that don't work are those called Event Based Profiling which requires Processor Dependent Control Register access. This means you won't get reports as to where and why your application is experiencing memory latency problems.
The TBP will get you 90% the way to optimized code, the EBP will get you a bit more. Ichose CodeAnalyst because my main simulation system uses AMD Opteron processors. My other two developement systems use Intel processors. I found runningCodeAnalyst on the Intel processors quite satisfactory.
I use the Windows version, but the site also has a Linux version too.
Jim Dempsey
For more complete information about compiler optimizations, see our Optimization Notice.