- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
Are any in this forum able to explain why the speedup is lost?
I took the open_sample.c, which is found in directory IntelCompilerC++10.1.013samplesopenmp_samples.c, and modified it slightly into the following:
/*
* Copyright (C) 2006-2007 Intel Corporation. All Rights Reserved.
*/
include
#include
#include
#include
#ifdef
_OPENMP#include
#endif
// Be careful to set your shell's stacksize limit to a high value if you
// wish to increase the SIZE.
#define
SIZE 4800 // Must be a multiple of 8.#define
M SIZE/8#define
N SIZE/4#define
P SIZE/2#define
NTIMES 5 // product matrix calculations// 2D access to 1D data structures
#define
A(i,j) (a[i*N+j])#define
B(i,j) (b[i*P+j])#define
C(i,j) (c[i*P+j])int
main(void){
#ifndef
USE_NEWdouble a[M*N], b[N*P], c[M*P];
#else
double *a, *b, *c;
#endif
double walltime;
bool nthr_checked=false;
time_t start;
int i, j, k, l, i1, i2, i3, k1, k2, k3, nthr=1;
#ifdef
USE_NEWa = new double[M*N];
b = new double[N*P];
c = new double[M*P];
#endif
printf("Using time() for wall clock time ");
printf("Problem size: c(%d,%d) = a(%d,%d) * b(%d,%d) ",
M, P, M, N, N, P);
printf("Calculating product %d time(s) ", NTIMES);
// a is identity matrix
for (i=0; i<M; i++)
for (j=0; j<N; j++)
A(i,j) = 1.0;
// each column of b is the sequence 1,2,...,N
for (i=0; i<N; i++)
for (j=0; j<P; j++)B(i,j) = i+1.;
start = time(NULL);
#pragma
omp parallel private(i,j,k){
for (l=0; l<NTIMES; l++) {
#pragma
omp single nowaitif (!nthr_checked) {
#ifdef
_OPENMPnthr = omp_get_num_threads();
#endif
printf( " We are using %d thread(s) ", nthr);
nthr_checked = true;
}
// Initialize product matrix
#pragma
omp for nowaitfor (i=0; i<M; i++)
for (j=0; j<P; j++)
C(i,j) = 0.0;
// Parallelize by row. The threads don't need to synchronize at
// loop end, so "nowait" can be used.
#pragma
omp for nowaitfor (i=0; i<M; i++) {
for (k=0; k<N; k++) {
// Each element of the product is just the sum 1+2+...+n
for (j=0; j<P; j++) {
C(i,j) += A(i,k) * B(k,j);
}
}
}
} // #pragma omp parallel private(i,j,k)
} // l=0,...NTIMES-1
walltime = time(NULL) - start;
printf(" Finished calculations. ");
printf("Matmul kernel wall clock time = %.2f sec ", walltime);
printf("Wall clock time/thread = %.2f sec ", walltime/nthr);
printf("MFlops = %f ",
(double)(NTIMES)*(double)(N*M*2)*(double)(P)/walltime/1.0e6);
return 0;
}
As can be seen, by defining USE_NEW arrays are allocated on the heap
by calls to C++'s new. By leaving USE_NEW undefined, arrays are put
on the stack.
I compile the C++ program with
icl /Qstd=c99 /Qopenmp /F256000000 openmp_sample.cpp
The best timing result is 2880 MFlops and 3.0 sec wall clock time/thread
(2 threads).
I compile with
icl /Qstd=c99 /Qopenmp /F256000000 /DUSE_NEW openmp_sample.cpp
The best timing result is 1016 MFlops and 8.5 sec wall clock time/thread
(2 threads).
The result when USE_NEW is defined is comparable to not using OpenMP. Why is
the speedup lost when dynamically allocated arrays are used? Are OpenMP really only usefull
when array sizes are know at compile time (I hardly believe this)? Is there something
I ought to set while compiling?
I hope you can help explain this.
Greetings
Martin
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Martin,
The heap is a single resource shared by all the threads of the process (application). As implemented by the C runtime library it contains critical sections (lets only one thread through at a time). If you want new to be (more) multi-threaded I suggest you consider overloading the low level new such that it maintains seperate heaps, one for each thread. Initiaize each thread heap from the shared heap to a reasonable working space (this allocation will be serialized). During run time should a local heap contain sufficient memory then the allocation can occure in parallel. Should a thread heap run out of memory then obtain additional memory from the common heap (serialized). You can make the scheme simple or complex.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Martin,
A few comments on your code
First use OMP_GET_WTIME to obtain a high precisiondouble of the time for use in timming your application.
Second, move the report for the number of threads outside the timedloop (rember to remove the #pragma omp single too)
Third, there is a bug in your code. The initialization of C(i,j)=0.0; is distributing on for(i=0... with no wait and the subsequent computation loop of C(i,j) += ... is also distributing on for(i=0... under the ASSUMPTION that the same thread issued a given i in the first loop is given the same i in the second loop. This assumption is not necessarily the case. You could very well be running on the same i in both loops under different threads. Although the 2nd loop will iterate slower and you would normally never see the 2nd loop complete the same i before the 1st loop completed the same i but there is a non-zero probability that the thread (or more appropriately core) running the 1st loop could be interrupted to run another application or O/S overhead operation and thus cause the 2nd loop on the given i to complete prior to the first loop in the given i. A suggestion is to either:
a) remove the nowaits from all loops
or
b) keep the nowaits and explicitly slice up the distribution of the outer loop by thread (team member) number
Although if you slice up with nowait then as you iterate through your master loop (NTIMES) you may begin working on different iterations of the master loop at the same time. When you rework this test case into an application you will find that you must place a barrier at the end of the master loop (or remove the nowait from the last loop).
I would suggest you begin with a) then experiment with slice and dice.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page