- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I was wondering if someone could help me out with why I am seeingperformance issues between with my openMP application. My application is a LDPC decoder, it's very simple in that each thread runs a copy of the decoder on a block of data.
Up until now, I have been writing my code for the Q6600. Now, my first assumption is that the Intel Xeon E7330 is as fast as the Core2Quad Q6600. And I am seeing that when running my code in a single thread.
With OpenMP, I definitely see a ~4x improvement withmultithreading on the Q6600 machine. However,on the Xeon,the performance isonly around2x better. The Xeon machine has 4 processors(16cores total).
I'm using the intel C++ compiler with flags -fast -openmp under redhat linux.
I could be doing something really dumb. Here's my code:
---------------------------------------------------------------------
#include
#include
#include
#include
#include
#include
using
namespace std;#include
"ldpc.h"#include
"ldpc_dec.h"int
main(int argc, char *argv[]){
int i, j, k, ii; int rate; int frames; unsigned int yOutputSize = 0;clock_t timer;
int n_thds = 0; int N = OUTPUT_SIZE;FILE *fpInputData, *fpOutputData;
clx_ldpc_dec * ldpc_decoder1;
#ifdef
MULTI_THREADED unsigned char xInput[OUTPUT_SIZE*NUMBER_OF_BUFFERS*4]; unsigned char * xInput2; unsigned char * xInput3; unsigned char * xInput4;clx_ldpc_dec * ldpc_decoder2;
clx_ldpc_dec * ldpc_decoder3;
clx_ldpc_dec * ldpc_decoder4;
#else
unsigned char xInput[OUTPUT_SIZE*NUMBER_OF_BUFFERS];#endif
rate = atoi(argv[3]);
ldpc_decoder1 =
new clx_ldpc_dec(rate);#ifdef
MULTI_THREADEDldpc_decoder2 =
new clx_ldpc_dec(rate);ldpc_decoder3 =
new clx_ldpc_dec(rate);ldpc_decoder4 =
new clx_ldpc_dec(rate);#endif
fpInputData = fopen(argv[1],
"rb"); if(fpInputData == NULL){printf(
"Failed to open input file %s ",argv[1]); return -1;}
fpOutputData = fopen(argv[2],
"wb"); if(fpOutputData == NULL){printf(
"Failed to open output file %s ",argv[2]); return -1;}
n_thds = omp_get_max_threads();
printf(
"Max Number of Threads = %d ", n_thds); if(n_thds > 4){
printf(
"Setting Max Number of Threads to 4 ");omp_set_num_threads(4);
}
timer = clock();
#ifdef
MULTI_THREADED for(k = 0; k < (NUMBDER_OF_BLOCKS/4)/NUMBER_OF_BUFFERS; k++){fread(xInput,
sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS*4, fpInputData);xInput2 = xInput+BYTES_TO_READ*NUMBER_OF_BUFFERS;
xInput3 = xInput+BYTES_TO_READ*NUMBER_OF_BUFFERS*2;
xInput4 = xInput+BYTES_TO_READ*NUMBER_OF_BUFFERS*3;
#pragma omp parallel sections{
#pragma omp section{
ldpc_decoder1->decode(xInput);
}
#pragma omp{
ldpc_decoder2->decode(xInput2);
}
#pragma omp section{
ldpc_decoder3->decode(xInput3);
}
#pragma omp section{
ldpc_decoder4->decode(xInput4);
}
}
fwrite(xInput,
sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS*4, fpOutputData);}
#else
for(k = 0; k < NUMBDER_OF_BLOCKS/NUMBER_OF_BUFFERS; k++){fread(xInput,
sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS, fpInputData);ldpc_decoder1->decode(xInput);
fwrite(xInput,
sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS, fpOutputData);}
#endif
timer = clock()-timer;
printf(
" Time: %f " , ((double)(timer) / (double)CLOCKS_PER_SEC));printf(
"%f mb/s ",((double)(OUTPUT_SIZE*NUMBER_OF_BUFFERS) / ((double)(timer) / (double)CLOCKS_PER_SEC) * 1e-6 ));fclose(fpInputData);
fclose(fpOutputData);
return 0;}
Thanks,
James
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The1 x Q6600 is one socket, each socket 4 cores but internally it is 2 processors, each of 2 cores and each processor with own cache. As such Q6600 has 2 separate cache systems.
The 4 x E7330 is four sockets, each socket 4 cores (x4=16 cores)but internally each E7330 has 2 processors (x4=8 processors), each of 2 cores (x8=16 cores) and as such there are8 seperate cache systems.
The test run on the Q6600 may have tended to keep the 4 threads more or less within the core in which it was started. Thus the cache would tend to contain reusable data.
The test run on the 4xE7330 may have tended to move the threads about the cores and thus the test may tend to experience loss of cached data.
To make the test fair set the affinities to force the test application to run within one socket. Note, the "processor" numbering scheme depends on how the how the BIOS and OS perform initialization. You may have to do some experimentation (or run a diagnostic) to determine the (Affinity)processor number mapping for the E7330. Once determined then set the affinity bit map for the application to run within the 4 cores of one socket.
The above settings permit the threads to float amongst cores but only within one socket. This would approximat the test environment of the Q6600.
A seperate test would be to lock each thread to a specific processor within one socket. Do this for both Q6600 and E7330.
A third test would be for the E7330 to lock each thread to a processor within a different socket.
A forth test would be for the E7330 to lock each thread in sequencetoeach processor within a same socket, then next socket.
E7330 socket to thread associations:
Test 1
|-F-||-F-| |---||---|
|-F-||-F-| |---||---|
|---||---| |---||---|
|---||---| |---||---|
Test 2
|-L-||-L-| |---||---|
|-L-||-L-| |---||---|
|---||---| |---||---|
|---||---| |---||---|
Test 3
|-L-||---| |-L-||---|
|---||---| |---||---|
|-L-||---| |-L-||---|
|---||---| |---||---|
Test 4
|-L-||-L-| |-L-||-L-|
|---||---| |---||---|
|---||---| |---||---|
|---||---| |---||---|
Where:
F=Float within selected processors
L=Locked to specific processor
Socket:
|---||---|
|---||---|
Processor witin socket: (2 cores sharing same cache)
|---|
|---|
Core witin processor within socket
|---|
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh, no! Desktop versus Server wars! Guess I haven't been checking this forum as often as I might.
I think you've already demonstrated that the E7330 cores run as fast as the Q6600 core in your single threaded test. And looking over the code sample you provided, nothing jumps out at me, but there's one thing that the code doesn't answer: how big are these blocks of data being fed to the decoder instantiations? The only difference in the specs I saw is that the Q6600 has a full 8 MB L2 cache while the E7330 only has a 6 MB L2. If we were just talking one processor versus the other, big enough buffers could result in cache thrashing on the server part with four cores that you might not see on the Q6600. The affinity tests Jim suggested should explore that configuration more thoroughly.
Beyond the processor, there are several other factors that come into play. The quad-socket E7330 uses FBDIMM memory versus DDR2 on the Q6600. The E7330 has 16 cores contending for memory on two busses versus only 4 cores on the Q6600. The code limits processing to 4 threads, but those four threads could land on any one of those 16 cores, as Jim suggested. And the way the code sample works, data for all four cores is read into memory by a single thread. If the buffer is small enough, it may still all reside in a single L2 cache where two cores have immediate access and the other two pay a small penalty, using Intel AdvancedSmart Cache to get to the other L2. Whereas on the E7330, worst case scenario, the 4 threads land on four separate sockets and three will have to wait for the data to be read from memory.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page