Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

OpenMP: No speedup with heap allocation

mohspt
Beginner
270 Views

Hello!

Are any in this forum able to explain why the speedup is lost?

I took the open_sample.c, which is found in directory IntelCompilerC++10.1.013samplesopenmp_samples.c, and modified it slightly into the following:

/*

* Copyright (C) 2006-2007 Intel Corporation. All Rights Reserved.

*/

include

#include

#include

#include

#ifdef

_OPENMP

#include

#endif

// Be careful to set your shell's stacksize limit to a high value if you

// wish to increase the SIZE.

#define

SIZE 4800 // Must be a multiple of 8.

#define

M SIZE/8

#define

N SIZE/4

#define

P SIZE/2

#define

NTIMES 5 // product matrix calculations

// 2D access to 1D data structures

#define

A(i,j) (a[i*N+j])

#define

B(i,j) (b[i*P+j])

#define

C(i,j) (c[i*P+j])

int

main(void)

{

#ifndef

USE_NEW

double a[M*N], b[N*P], c[M*P];

#else

double *a, *b, *c;

#endif

double walltime;

bool nthr_checked=false;

time_t start;

int i, j, k, l, i1, i2, i3, k1, k2, k3, nthr=1;

#ifdef

USE_NEW

a = new double[M*N];

b = new double[N*P];

c = new double[M*P];

#endif

printf("Using time() for wall clock time ");

printf("Problem size: c(%d,%d) = a(%d,%d) * b(%d,%d) ",

M, P, M, N, N, P);

printf("Calculating product %d time(s) ", NTIMES);

// a is identity matrix

for (i=0; i<M; i++)

for (j=0; j<N; j++)

A(i,j) = 1.0;

// each column of b is the sequence 1,2,...,N

for (i=0; i<N; i++)

for (j=0; j<P; j++)

B(i,j) = i+1.;

start = time(NULL);

#pragma

omp parallel private(i,j,k)

{

for (l=0; l<NTIMES; l++) {

#pragma

omp single nowait

if (!nthr_checked) {

#ifdef

_OPENMP

nthr = omp_get_num_threads();

#endif

printf( " We are using %d thread(s) ", nthr);

nthr_checked = true;

}

// Initialize product matrix

#pragma

omp for nowait

for (i=0; i<M; i++)

for (j=0; j<P; j++)

C(i,j) = 0.0;

// Parallelize by row. The threads don't need to synchronize at

// loop end, so "nowait" can be used.

#pragma

omp for nowait

for (i=0; i<M; i++) {

for (k=0; k<N; k++) {

// Each element of the product is just the sum 1+2+...+n

for (j=0; j<P; j++) {

C(i,j) += A(i,k) * B(k,j);

}

}

}

} // #pragma omp parallel private(i,j,k)

} // l=0,...NTIMES-1

walltime = time(NULL) - start;

printf(" Finished calculations. ");

printf("Matmul kernel wall clock time = %.2f sec ", walltime);

printf("Wall clock time/thread = %.2f sec ", walltime/nthr);

printf("MFlops = %f ",

(double)(NTIMES)*(double)(N*M*2)*(double)(P)/walltime/1.0e6);

return 0;

}

As can be seen, by defining USE_NEW arrays are allocated on the heap

by calls to C++'s new. By leaving USE_NEW undefined, arrays are put

on the stack.

I compile the C++ program with

icl /Qstd=c99 /Qopenmp /F256000000 openmp_sample.cpp

The best timing result is 2880 MFlops and 3.0 sec wall clock time/thread

(2 threads).

I compile with

icl /Qstd=c99 /Qopenmp /F256000000 /DUSE_NEW openmp_sample.cpp

The best timing result is 1016 MFlops and 8.5 sec wall clock time/thread

(2 threads).

The result when USE_NEW is defined is comparable to not using OpenMP. Why is

the speedup lost when dynamically allocated arrays are used? Are OpenMP really only usefull

when array sizes are know at compile time (I hardly believe this)? Is there something

I ought to set while compiling?

I hope you can help explain this.

Greetings

Martin

0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
270 Views

Martin,

The heap is a single resource shared by all the threads of the process (application). As implemented by the C runtime library it contains critical sections (lets only one thread through at a time). If you want new to be (more) multi-threaded I suggest you consider overloading the low level new such that it maintains seperate heaps, one for each thread. Initiaize each thread heap from the shared heap to a reasonable working space (this allocation will be serialized). During run time should a local heap contain sufficient memory then the allocation can occure in parallel. Should a thread heap run out of memory then obtain additional memory from the common heap (serialized). You can make the scheme simple or complex.

Jim Dempsey

0 Kudos
rbarron
New Contributor I
270 Views
There are only three heap allocations. After that, the programaccess the memory blocks directly without going through the CRT. So it the big slowdown remains unexplained.
0 Kudos
jimdempseyatthecove
Honored Contributor III
270 Views

Martin,

A few comments on your code

First use OMP_GET_WTIME to obtain a high precisiondouble of the time for use in timming your application.

Second, move the report for the number of threads outside the timedloop (rember to remove the #pragma omp single too)

Third, there is a bug in your code. The initialization of C(i,j)=0.0; is distributing on for(i=0... with no wait and the subsequent computation loop of C(i,j) += ... is also distributing on for(i=0... under the ASSUMPTION that the same thread issued a given i in the first loop is given the same i in the second loop. This assumption is not necessarily the case. You could very well be running on the same i in both loops under different threads. Although the 2nd loop will iterate slower and you would normally never see the 2nd loop complete the same i before the 1st loop completed the same i but there is a non-zero probability that the thread (or more appropriately core) running the 1st loop could be interrupted to run another application or O/S overhead operation and thus cause the 2nd loop on the given i to complete prior to the first loop in the given i. A suggestion is to either:

a) remove the nowaits from all loops

or

b) keep the nowaits and explicitly slice up the distribution of the outer loop by thread (team member) number

Although if you slice up with nowait then as you iterate through your master loop (NTIMES) you may begin working on different iterations of the master loop at the same time. When you rework this test case into an application you will find that you must place a barrier at the end of the master loop (or remove the nowait from the last loop).

I would suggest you begin with a) then experiment with slice and dice.

Jim Dempsey

0 Kudos
Reply