OpenMP: No speedup with heap allocation

mohspt · ‎04-16-2008

Hello!

Are any in this forum able to explain why the speedup is lost?

I took the open_sample.c, which is found in directory IntelCompilerC++10.1.013samplesopenmp_samples.c, and modified it slightly into the following:

/*
* Copyright (C) 2006-2007 Intel Corporation. All Rights Reserved.
*/

include #include #include #include #ifdef _OPENMP#include #endif// Be careful to set your shell's stacksize limit to a high value if you
// wish to increase the SIZE.#define SIZE 4800 // Must be a multiple of 8.#define M SIZE/8
#define N SIZE/4
#define P SIZE/2
#define NTIMES 5 // product matrix calculations
// 2D access to 1D data structures#define A(i,j) (a[i*N+j])
#define B(i,j) (b[i*P+j])
#define C(i,j) (c[i*P+j])
int main(void)
{
#ifndef USE_NEWdouble a[M*N], b[N*P], c[M*P];
#elsedouble *a, *b, *c;
#endifdouble walltime;
bool nthr_checked=false;
time_t start;
int i, j, k, l, i1, i2, i3, k1, k2, k3, nthr=1;
#ifdef USE_NEWa = new double[M*N];
b = new double[N*P];
c = new double[M*P];
#endifprintf("Using time() for wall clock time
");
printf("Problem size: c(%d,%d) = a(%d,%d) * b(%d,%d)
",
M, P, M, N, N, P);
printf("Calculating product %d time(s)
", NTIMES);
// a is identity matrix
for (i=0; i<M; i++)
for (j=0; j<N; j++)
A(i,j) = 1.0;
// each column of b is the sequence 1,2,...,N
for (i=0; i<N; i++)
for (j=0; j<P; j++)
B(i,j) = i+1.;
start = time(NULL);
#pragma omp parallel private(i,j,k)
{
for (l=0; l<NTIMES; l++) {
#pragma omp single nowaitif (!nthr_checked) {
#ifdef _OPENMPnthr = omp_get_num_threads();
#endifprintf( "
We are using %d thread(s)
", nthr);
nthr_checked = true;
}
// Initialize product matrix
#pragma omp for nowaitfor (i=0; i<M; i++)
for (j=0; j<P; j++)
C(i,j) = 0.0;
// Parallelize by row. The threads don't need to synchronize at
// loop end, so "nowait" can be used.
#pragma omp for nowaitfor (i=0; i<M; i++) {
for (k=0; k<N; k++) {
// Each element of the product is just the sum 1+2+...+n
for (j=0; j<P; j++) {
C(i,j) += A(i,k) * B(k,j);
}
}
}
} // #pragma omp parallel private(i,j,k)
} // l=0,...NTIMES-1
walltime = time(NULL) - start;
printf("
Finished calculations.
");
printf("Matmul kernel wall clock time = %.2f sec
", walltime);
printf("Wall clock time/thread = %.2f sec
", walltime/nthr);
printf("MFlops = %f
",
(double)(NTIMES)*(double)(N*M*2)*(double)(P)/walltime/1.0e6);
return 0;
}
As can be seen, by defining USE_NEW arrays are allocated on the heap 
by calls to C++'s new. By leaving USE_NEW undefined, arrays are put
on the stack.
I compile the C++ program with
icl /Qstd=c99 /Qopenmp /F256000000 openmp_sample.cpp 
The best timing result is 2880 MFlops and 3.0 sec wall clock time/thread
(2 threads).
I compile with
icl /Qstd=c99 /Qopenmp /F256000000 /DUSE_NEW openmp_sample.cpp 
The best timing result is 1016 MFlops and 8.5 sec wall clock time/thread
(2 threads).
The result when USE_NEW is defined is comparable to not using 
OpenMP.  Why is 
the speedup lost when dynamically allocated arrays are used? Are OpenMP really only usefull
when array sizes are know at compile time (I hardly believe this)? Is there something
I ought to set while compiling?
I hope you can help explain this.
Greetings
Martin

jimdempseyatthecove · ‎04-16-2008

Martin,

The heap is a single resource shared by all the threads of the process (application). As implemented by the C runtime library it contains critical sections (lets only one thread through at a time). If you want new to be (more) multi-threaded I suggest you consider overloading the low level new such that it maintains seperate heaps, one for each thread. Initiaize each thread heap from the shared heap to a reasonable working space (this allocation will be serialized). During run time should a local heap contain sufficient memory then the allocation can occure in parallel. Should a thread heap run out of memory then obtain additional memory from the common heap (serialized). You can make the scheme simple or complex.

Jim Dempsey

rbarron · ‎04-18-2008

There are only three heap allocations. After that, the programaccess the memory blocks directly without going through the CRT. So it the big slowdown remains unexplained.

jimdempseyatthecove · ‎04-23-2008

Martin,

A few comments on your code

First use OMP_GET_WTIME to obtain a high precisiondouble of the time for use in timming your application.

Second, move the report for the number of threads outside the timedloop (rember to remove the #pragma omp single too)

Third, there is a bug in your code. The initialization of C(i,j)=0.0; is distributing on for(i=0... with no wait and the subsequent computation loop of C(i,j) += ... is also distributing on for(i=0... under the ASSUMPTION that the same thread issued a given i in the first loop is given the same i in the second loop. This assumption is not necessarily the case. You could very well be running on the same i in both loops under different threads. Although the 2nd loop will iterate slower and you would normally never see the 2nd loop complete the same i before the 1st loop completed the same i but there is a non-zero probability that the thread (or more appropriately core) running the 1st loop could be interrupted to run another application or O/S overhead operation and thus cause the 2nd loop on the given i to complete prior to the first loop in the given i. A suggestion is to either:

a) remove the nowaits from all loops

or

b) keep the nowaits and explicitly slice up the distribution of the outer loop by thread (team member) number

Although if you slice up with nowait then as you iterate through your master loop (NTIMES) you may begin working on different iterations of the master loop at the same time. When you rework this test case into an application you will find that you must place a barrier at the end of the master loop (or remove the nowait from the last loop).

I would suggest you begin with a) then experiment with slice and dice.

Jim Dempsey