Newbie in multi-threaded programming and OpenMP with a handful of questions

eren · ‎07-07-2006

Greetings to all,

I have been working on signal processing and perceptual components applications using C++ for some time, but lately I've run into some execution speed problems, so I've decided to try out multi-threaded programming. After a couple of days of playing around with pthreads and OpenMP, I can say I am positively stumped with what seem to be very simple stuff, so I ask for your help (and understanding if these questions are simply too naive). All my experiments have been carried out on a 32-bit Dual Xeon @ 2.8 GHz with 2 GB of RAM running SuSe Linux 9.3.

1) First of all, is there a way of using OpenMP other than installing an OpenMP-aware compiler (I have used Intel's Linux C++ Compiler 9.1 as a starting point)?

2) How can I verify that my system indeed supports OpenMP after I have installed such a compiler? I know that there are a number of environmental variables (such as OMP_NUM_THREADS), however I get nothing when I echo one of them. Are these set implicitly when OpenMP support is installed or do I have to do that myself by running a script or using the export command on a shell?

3) From what I can tell, including is only necessary if I need to use one of the run-time library functions, such as omp_get_num_threads(), whereas the use of #pragma's does not require . Is that correct? On an OpenMP-aware compiler, is it enough to use the #pragma's or do I also need to add the -openmp option? If I get a message such as "OpenMP DEFINED SECTION WAS PARALLELIZED" during compilation, can I rest assured that the program will execute using mupltiple threads?

4) I have been trying to make KDevelop (version 3.2) use the Intel Compiler instead of the default gcc/g++ (since the version I have does not seem to support OpenMP), but without any success. Although I have installed the Linux Compiler, when I go to Project -> Project Options -> Configure Options -> C++, the only compiler listed is the GNU C++ one. How can I enable other compilers?

5) I did manage to use the Eclipse tool provided in the Intel compiler installation to create a project and test OpenMP, however the results were quite disappointing in terms of speed: 25% and 50% reduction when using 2 and 4 threads respectively, when compared to the serialized version. Assuming that I have indeed setup OpenMP support correctly, this probably means that I have parallelized my code quite poorly (to say the least). The program works iteratively on new frames captured from a camera and performs the same sequence of operations on each pixel of the image buffer. Since operations on one pixel do not depend on what happens with the others, I thought of breaking up the image on 2 or 4 parts and run the same function on each sub-image. The OpenMP construct I used in the 4-thread case looks as follows:

#pragma omp parallel sections num_threads(4)
{
#pragma omp section
f(0);
#pragma omp section
f(1);
#pragma omp section
f(2);
#pragma omp section
f(3);
}

Notice that although the same function is called for all sections, each thread is expected to work on different (and non-overlapping) segments of the initial image. Also, the function processes each pixel and writes the outputs in globally defined buffers (again with no overlap across threads), so I am hoping that there are no race condition issues; indeed, the results are as expected from the serialized version. I am therefore puzzled as to why I suffer such a performance degradation... Is it because each thread calls the same function? Or because all threads have to access the same output buffer (albeit at different positions) and therefore must wait for each other to finish? Or am I completely off the mark here?

6) I have seen that there are Threading Tools that could answer my previous question offered by Intel, however they are currently only available for Win platforms. Is any info on a future Linux release available?

That is all (quite a lot actually...). I would appreciate any and all answers or comments.

Thank you,
Andreas.

jimdempseyatthecove · ‎07-07-2006

Hi Andreas,

I can answer a few of your questions.

1) installing an OpenMP-aware compiler

OpenMP is an organization (and specification) that specifies a set of compiler directives as a set of #pragmas for C/C++ and comments for FORTRAN. So if you are not using an OpenMP aware compiler you are not by definition using OpenMP.

My OpenMP environment is WindowsXP Pro and Intel Visual Fortran. The systems I run on are two Intel HT only systems and one dual processor/4 core AMDOpteron system. Initialy I had problems on the HT only systems were I observed a decrease in performance. Now I can see up to 30% improvement on the HT only systems. On the 4-core Opteron I can see about 300% over single core. Some sections of code experience ~400% improvementsome no improvement.

2) How can I verify that my system indeed supports OpenMP after I have installed such a compiler?

In your test program:

#pragma omp parallel sections num_threads(4)
{
#pragma omp section
f(0);
#pragma omp section
f(1);
#pragma omp section
f(2);
#pragma omp section
f(3);
}

Have the function f print the entry argument and print the OpenMP thread number.

printf should be thread safe but if you have problems with multiple threads using printf then use the entry arg to f(int i) as an index into a global table (int fTable[4];) then store the thread number into the array

fTable = omp_get_thread_num();

Then after the close of the test section print out the contents of the fTable

If you do not see four different threads then your code might not be running with OpenMP.

You do realize that the C++ OpenMP ships with no-op stub routines that can be linked into the program in lieu of the OpenMP runtime system. Have you verified that you are using the OpenMP runtime system library instead of the no-op stub routines?

3) On an OpenMP-aware compiler, is it enough to use the #pragma's or do I also need to add the -openmp option? If I get a message such as "OpenMP DEFINED SECTION WAS PARALLELIZED" during compilation, can I rest assured that the program will execute using mupltiple threads?

Get your test example working with multiple threadswhile using -openmp. Then try without -openmp. As I do not use Intel C++ I cannot try this for you. There may be an issue as to what is pulled from the runtime library. i.e. without -openmp the no-op stubs get linked in. It may be that -openmp is required on main. You can experiment if/if not.

4) --- I cannot help you here

5) --- see answer for 2)

6) - threading tools.

As you stand now it is a case of "Are you running with parallel sections or not".

For this, modify the test case such that your function f() takes 2 seconds to run using1 processor. Make this a compute bound test. The program shouldrun 8 seconds with 4 serial calls to f(). If running configured for parallel it should be much less.

Good luck

I am quiet pleased with OpenMP in Intel Visual Fortran (on Windoz). I am sure you will be too once you get the test configuration working. I think you may have a linking problem (using no-op stubs).

Jim Dempsey

eren · ‎07-10-2006

Hi Jim,

first of all, thank you for your input. It seems that the code I presented actually runs using 4 threads, as I get four different thread IDs. If I ask for two threads, I can verify that each of them runs two times (as there are still four parallel sections). So I seem to have enabled multi-threaded compilation just fine.

I am not sure what you mean by "You do realize that the C++ OpenMP ships with no-op stub routines that can be linked into the program in lieu of the OpenMP runtime system.", though. Could you be a little bit more specific?

As to why this doesn't speed things up, I still cannot tell. I was thinking of moving the parallelization inside the function itself instead of calling it four times, using a parallel for #pragma. I guess this will be somewhat more complicated as I will have to be on the lookout for private and shared variables, but, as they say, no pain, no gain, right? I don't seem to getting anywhere if I just keep the code as it is anyway...

I will also try to install the newest gcc that seems to be OpenMP-aware and see how that turns out, so I can at least still use KDevelop as my IDE. Unless some good folks point out to me how I can setup KDevelop to use Intel's compiler in the meantime...

I don't quite get your answer on question 6, but I guess it means there is no info on an upcoming Linux release of the Threading Tools?

I will step back from my efforts at multi-threaded programming for a few days, to complete my application using a single thread and then get back in the OpenMP game. I will post any new findings I come across.

Thanks again,
Andreas.

jimdempseyatthecove · ‎07-10-2006

The fact that you see four thread numbers is a good starting point.

Part of the OpenMP specification requires that no-op stub routines be shipped such that they can be linked into the application when running in environments where multiple threads is detrimental. Example: you ship your application out in .OBJ (.LIB) format compiled with OpenMP but linked atcustomer descretion with or without OpenMP.

If you are not seeing a speedup then either a) not much code is being executed in the f(n) or b)your application has multiple threads but your Operating System is restricting the application to one processor, or c) the code is saturating the memory bandwith.

Your parallel section code should be fine. Make a simple f(n) function that is compute intensive as well as low-memory bandwidth (or high cache usage). Note, if f(n) is simply performing a memory population then the test is limited by memory bandwidth and each processor may be flushing the other processor's cache when performing writes.

Below is a simple compute intensive application written in C that I found on the internet.

Change main(... to your f(...) and appropriately edit the printf calls.

--------- begin ------------

/*--- pi.c PROGRAM RANPI
*
* Program to compute PI by probability.
* By Mark Riordan 24-DEC-1986;
* Original version apparently by Don Shull.
* To be used as a CPU benchmark.
*
* Translated to C from FORTRAN 20 Nov 1993
*/
#include

void
myadd(float *sum,float *addend);

int
main(int argc, char *argv[]) {
float ztot, yran, ymult, ymod, x, y, z, pi, prod;
long int low, ixran, itot, j, iprod;

printf("Starting PI... ");
ztot = 0.0;
low = 1;
ixran = 1907;
yran = 5813.0;
ymult = 1307.0;
ymod = 5471.0;
itot = 1200000;

for(j=1; j<=itot; j++) {

/*
c X and Y are two uniform random numbers between 0 and 1.
c They are computed using two linear congruential generators.
c A mix of integer and real arithmetic is used to simulate a
c real program. Magnitudes are kept small to prevent 32-bit
c integer overflow and to allow full precision even with a 23-bit
c mantissa.
*/

iprod = 27611 * ixran;
ixran = iprod - 74383*(long int)(iprod/74383);
x = (float)ixran / 74383.0;
prod = ymult * yran;
yran = (prod - ymod*(long int)(prod/ymod));
y = yran / ymod;
z = x*x + y*y;
myadd(&ztot,&z);
if ( z <= 1.0 ) {
&n bsp; low = low + 1;
}
}
printf(" x=%8.5f y=%8.5f low=%7d j=%7d ",x,y,low,j);
pi = 4.0 * (float)low/(float)itot;
printf("Pi = %9.6f ztot=%12.2f itot=%8d ",pi,ztot,itot);

return 0;
}

void
myadd(float *sum,float *addend) {

/*
c Simple adding subroutine thrown in to allow subroutine
c calls/returns to be factored in as part of the benchmark.
*/
*sum = *sum + *addend;
}
-------------------------- end --------------

If you see no speedup then you may be linking in the wrong library or the kernal is inhibiting the application from using the multiple processors.

Jim Dempsey

eren · ‎07-11-2006

Jim,

once again I thank you. I will try it out some time this week and let you know of the outcome.

Andreas.

jimdempseyatthecove · ‎07-11-2006

Andreas,

Once you resolve the issue of running on multiple processors your next job is to determine how best to divide up your code using OpenMP. There are many ways to do this. The two predominant ways are

1) Parallel-ize the large inner loops

2) Parallel-ize the outer master control loop(s)

For the code I work on (Finite Element Analysis of tension structures) I found number 2 works best. This is because I can setup each component to advance the state independently. Then at end of component state advancement I reconcile the component to component interaction. What this did for me was the reduce the number of transitions between serial sections and parallel sections. Starting and stopping threads does introduce overhead. The loop size must be large enough to overcome this overhead to yield some payback.

Other applications (yours?) might do best by parallelizing the inner loops.

The best method will depend on the application as well as the size of the data sets handled by the application.

For my application when using coarse granulation (less nodes) method 2) works best. However, as I increase the number of nodes then at some point method 1) will work best. Eventualy, I expect this application to include both methods and todeterminefrom the input data set which isbest.

To help optimize your code you should consider using a profiler. Intel has a product (for sale) called vTune. If you are on a tight budget you might consider looking at AMD Developer Center (join for free) then looking for a free tool called CodeAnalyst. This profiler is developed for AMD processors, I know you have Intel XEON processors. However, instead of refusing to work on Intel processors, CodeAnalyst simply disables the features it cannot use. What this means is you are left with a functional subset called Time Based Profiling TBP. With TBP you can locate the bottlenecks in your code which is usualy sufficient to help you tune your code. The features that don't work are those called Event Based Profiling which requires Processor Dependent Control Register access. This means you won't get reports as to where and why your application is experiencing memory latency problems.

The TBP will get you 90% the way to optimized code, the EBP will get you a bit more. Ichose CodeAnalyst because my main simulation system uses AMD Opteron processors. My other two developement systems use Intel processors. I found runningCodeAnalyst on the Intel processors quite satisfactory.

I use the Windows version, but the site also has a Linux version too.

Jim Dempsey