performance decrease using openmp

f_p_boomstra · ‎04-13-2007

Hi, my name is Feike Boomstra.
I have bought a new computer with a Pentium D915. I am developing (for my own fun) a draught playing program in c. I wanted to incoporate parallel processing, because I have two cores now. I am using the intel 9.1 compiler icpc on Ubuntu 6.10.

I observed a significant decrease of performance. I tried a lot of things, but ended up to write a very simple test program:

-----------------------------
#ifdef HAVE_CONFIG_H
#include
#endif

#include
#include

#define ARRAY_SIZE 1000000

long a[ARRAY_SIZE];
long b[ARRAY_SIZE];
long c[ARRAY_SIZE];

void fill_array(long x[])
{ int i;
for (i=0; i}

void calc()
{ int i, k;
#pragma omp parallel for private(i)
for (k=0; k
for (i=0; i
c = a + b;
return;
}

int main(int argc, char *argv[])
{ double start_time;
double stop_time;
double duration;

printf("Additions started ");
fill_array(a);
fill_array(b);
start_time = clock();
calc();
stop_time = clock();
duration = (double)(stop_time - start_time) / CLOCKS_PER_SEC;
printf("Nr of additions: %d, time: %5.2f ", ARRAY_SIZE, duration);

return 0;
}
----------------------------------

If I compile it without the -openmp option,
the elapse time is 10 sec.
Compiled with the option - openmp, the elapse time becomes 22 seconds.

What is happening???

(the program doesn't show up correctly, after the "less then" sign everything disappears.

TimP · ‎04-13-2007

For a memory bandwidth limited application, you can expect the total CPU time used by 2 cores to increase significantly over the time used in a serial run. Even in an ideal parallel situation, you are looking for a decrease in elapsed time, not a decrease in total CPU time. In this case, a small increase in elapsed time is possible, as you may be increasing the number of times some of the data pass between memory and cache.

If your outer k loop is such that the compiler could recognize it as a do-nothing loop, it might be optimized away to a different extent in an openmp vs non-openmp compilation. Normally, you would go through a non-optimizable function after each execution of the inner loop, to avoid that effect.

f_p_boomstra · ‎04-14-2007

OK, let's talk aboput my real application.
It is a draught playing program. The engine is written in c (c++).
I am using the alpha-beta-with-memory algorithm as described by Aske van der Plaats.
Basically it is alpha-beta with a hash table to lookup positions seen in a previous run.
The effect of the hash-table on the number of positions evaluated is enormous.

The algorithm is highly recursive. Do parallize the outermost loop, I copied the alpha-beta procedure and renamed it to first_alpha_beta. This procedure calls the recursive one. In first-alpha-beta I tried to parallize.

The problem is the shared hash-table. First I tried to create a lock for each entry.
The size of the hash-table is 0x800000.
Trying to initialize this amount of locks created an unpredictable result.
Once the program stayed away for ever and one time it finished after 5 minutes, but then my sockets were closed (?).
So I had to use one lock for the entire hashtable. It worked, but the elapse time increased. I removed the lock on read, accepting some errors in "best move" retrieving, but this did not help.
I then tried to use two hash tables, one for each thread. The number of positions evaluated increased with 70%, the elapse time also increased compared to the serial program.
So I am stuck.

TimP · ‎04-14-2007

I wouldn't consider a hash table a beginner's threading project, but I'm no authority on that. There is a hash table implementation in Intel Threaded Building Blocks, in case you're interested.

jimdempseyatthecove · ‎04-15-2007

The following comments are based on my assumptions of what you are trying to do. My lack of understanding of your code may make my comments meaningless....

A good way to handle any multi-threaded problem is to eliminate (as much as possible) lock conditions.

Think of each of your hash table entries as having 3 states:

Never used
Has value previously computed
Computation in progress

Have each thread have a queue of pending operations (initialized to none pending)

Currently when your code "ponders" a position it computes the hash code of the position under consideration, consults the hash table, if the hash table entry for the position indicates no value then a call is made to produce the value (currently you are locking (wanting to lock)this entry until computation is complete), the entery is inserted into the hash table, endif.Here if/when hash table has result from prior computaton, The hashtable entryis used in some function toadvance towardssolution.

In your first attempt (lock per hash table cell), the 2nd thread attempting toconsider the same position would block until the thread producing the results completes. But you ran out of locks in the first attempt. In the second attempt it appears you resulted in both threads seeing the entry for the hash code as empty and both threads computed the result.

Try seeing if you can change the code such that the hash code table entries have three states (Never used, Has result, Pending). If Has result - use it, if Never used - use InterlockedCompareAndExchange to set state to Pending (and to detect if other thread beat you to setting pending) and compute result, if Pending - create entry into the new queue I mentioned used for reconsider position at alater time- then proceed to consider new position. Periodically, or as you unwind your recursion, look at the queue to see if there are any positions to reconsider.

If at all possible, code without using Mutex, critical sections, and other blocking code. Use the InterlockedXxxx collection of function calls.

Jim Dempsey

f_p_boomstra · ‎04-17-2007

Thanks for your detailed suggestions. The first thing I had to do was to analyze the nature of the bad performance. So I downloaded Vtune on my Linux system, but unfortunately I am using Kubuntu which is debian based and that is not supported.

Feike Boomstra

jimdempseyatthecove · ‎04-17-2007

Try amd.com and look for CodeAnalyst. Timer based sampling will run on Intel processors. They may have a Linux version (I use the Windows version).

You should be able to locate hot spots.

Jim

f_p_boomstra · ‎04-19-2007

I tried oprofile in timer mode, my suspiscion is that I have a cache problem, so I need a profiler that is able to access the internal counters in the cpu.

Feike

f_p_boomstra · ‎05-20-2007

Just an update. I migrated to Fedora FC6. Got vtune installed. There is a problem with vtlec, I assume somewhere in the java area. (Eclipse on emt64 works with compiled java as there is no Sun 64 bits JRE.)

But in the meantime I found a reference to the implementation of the clock() function on FC6. It gives the total time for all threads together! I now use the omp_wall_clock time and the results are as expected. Nearly doubling of the performance by going from one thread to two.

Unfortunately there seems to be a memory race condition somewhere as the output is not stable. So I downloaded the thread checker, but have a problem in compiling with -tcheck. I have open a new thread on that.

Feike Boomstra