OpenMP performance Core2 Quad Q6600 versus Xeon E7330?

jpgolab · ‎02-05-2008

Hi,

I was wondering if someone could help me out with why I am seeingperformance issues between with my openMP application. My application is a LDPC decoder, it's very simple in that each thread runs a copy of the decoder on a block of data.

Up until now, I have been writing my code for the Q6600. Now, my first assumption is that the Intel Xeon E7330 is as fast as the Core2Quad Q6600. And I am seeing that when running my code in a single thread.

With OpenMP, I definitely see a ~4x improvement withmultithreading on the Q6600 machine. However,on the Xeon,the performance isonly around2x better. The Xeon machine has 4 processors(16cores total).

I'm using the intel C++ compiler with flags -fast -openmp under redhat linux.

I could be doing something really dumb. Here's my code:

---------------------------------------------------------------------

#include

using

namespace std;

#include

"ldpc.h"

#include

"ldpc_dec.h"

int

main(int argc, char *argv[])

{

int i, j, k, ii;

int rate;

int frames;

unsigned int yOutputSize = 0;

clock_t timer;

int n_thds = 0;

int N = OUTPUT_SIZE;

FILE *fpInputData, *fpOutputData;

clx_ldpc_dec * ldpc_decoder1;

#ifdef

MULTI_THREADED

unsigned char xInput[OUTPUT_SIZE*NUMBER_OF_BUFFERS*4];

unsigned char * xInput2;

unsigned char * xInput3;

unsigned char * xInput4;

clx_ldpc_dec * ldpc_decoder2;

clx_ldpc_dec * ldpc_decoder3;

clx_ldpc_dec * ldpc_decoder4;

#else

unsigned char xInput[OUTPUT_SIZE*NUMBER_OF_BUFFERS];

#endif

rate = atoi(argv[3]);

ldpc_decoder1 =

new clx_ldpc_dec(rate);

#ifdef

MULTI_THREADED

ldpc_decoder2 =

new clx_ldpc_dec(rate);

ldpc_decoder3 =

new clx_ldpc_dec(rate);

ldpc_decoder4 =

new clx_ldpc_dec(rate);

#endif

fpInputData = fopen(argv[1],

"rb");

if(fpInputData == NULL){

printf(

"Failed to open input file %s ",argv[1]);

return -1;

}

fpOutputData = fopen(argv[2],

"wb");

if(fpOutputData == NULL){

printf(

"Failed to open output file %s ",argv[2]);

return -1;

}

n_thds = omp_get_max_threads();

printf(

"Max Number of Threads = %d ", n_thds);

if(n_thds > 4)

{

printf(

"Setting Max Number of Threads to 4 ");

omp_set_num_threads(4);

}

timer = clock();

#ifdef

MULTI_THREADED

for(k = 0; k < (NUMBDER_OF_BLOCKS/4)/NUMBER_OF_BUFFERS; k++){

fread(xInput,

sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS*4, fpInputData);

xInput2 = xInput+BYTES_TO_READ*NUMBER_OF_BUFFERS;

xInput3 = xInput+BYTES_TO_READ*NUMBER_OF_BUFFERS*2;

xInput4 = xInput+BYTES_TO_READ*NUMBER_OF_BUFFERS*3;

#pragma omp parallel sections

{

#pragma omp section

{

ldpc_decoder1->decode(xInput);

}

#pragma omp section

{

ldpc_decoder2->decode(xInput2);

}

#pragma omp section

{

ldpc_decoder3->decode(xInput3);

}

#pragma omp section

{

ldpc_decoder4->decode(xInput4);

}

fwrite(xInput,

sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS*4, fpOutputData);

}

#else

for(k = 0; k < NUMBDER_OF_BLOCKS/NUMBER_OF_BUFFERS; k++){

fread(xInput,

sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS, fpInputData);

ldpc_decoder1->decode(xInput);

fwrite(xInput,

sizeof(unsigned char), BYTES_TO_READ*NUMBER_OF_BUFFERS, fpOutputData);

}

#endif

timer = clock()-timer;

printf(

" Time: %f " , ((double)(timer) / (double)CLOCKS_PER_SEC));

printf(

"%f mb/s ",((double)(OUTPUT_SIZE*NUMBER_OF_BUFFERS) / ((double)(timer) / (double)CLOCKS_PER_SEC) * 1e-6 ));

fclose(fpInputData);

fclose(fpOutputData);

return 0;

}

Thanks,

James

TimP · ‎02-06-2008

I think we're mystified about what you're getting at. It seems logical that you would affinitize your threads to a single CPU, although you haven't mentioned it. If KMP_AFFINITY=compact doesn't work, specifying 4 cores on a single socket, by number, might. In the latter case, you would need to open a separate environment for each job, if your intent is to run more than one of these 4 thread jobs at a time. Still, a job which runs well on a single socket doesn't appear to be one for which a 4 socket machine could be recommended.

jimdempseyatthecove · ‎02-06-2008

The1 x Q6600 is one socket, each socket 4 cores but internally it is 2 processors, each of 2 cores and each processor with own cache. As such Q6600 has 2 separate cache systems.

The 4 x E7330 is four sockets, each socket 4 cores (x4=16 cores)but internally each E7330 has 2 processors (x4=8 processors), each of 2 cores (x8=16 cores) and as such there are8 seperate cache systems.

The test run on the Q6600 may have tended to keep the 4 threads more or less within the core in which it was started. Thus the cache would tend to contain reusable data.

The test run on the 4xE7330 may have tended to move the threads about the cores and thus the test may tend to experience loss of cached data.

To make the test fair set the affinities to force the test application to run within one socket. Note, the "processor" numbering scheme depends on how the how the BIOS and OS perform initialization. You may have to do some experimentation (or run a diagnostic) to determine the (Affinity)processor number mapping for the E7330. Once determined then set the affinity bit map for the application to run within the 4 cores of one socket.

The above settings permit the threads to float amongst cores but only within one socket. This would approximat the test environment of the Q6600.

A seperate test would be to lock each thread to a specific processor within one socket. Do this for both Q6600 and E7330.

A third test would be for the E7330 to lock each thread to a processor within a different socket.

A forth test would be for the E7330 to lock each thread in sequencetoeach processor within a same socket, then next socket.

E7330 socket to thread associations:

Test 1
|-F-||-F-| |---||---|
|-F-||-F-| |---||---|

|---||---| |---||---|
|---||---| |---||---|

Test 2
|-L-||-L-| |---||---|
|-L-||-L-| |---||---|

|---||---| |---||---|
|---||---| |---||---|

Test 3
|-L-||---| |-L-||---|
|---||---| |---||---|

|-L-||---| |-L-||---|
|---||---| |---||---|

Test 4
|-L-||-L-| |-L-||-L-|
|---||---| |---||---|

|---||---| |---||---|
|---||---| |---||---|

Where:
F=Float within selected processors
L=Locked to specific processor

Socket:
|---||---|
|---||---|
Processor witin socket: (2 cores sharing same cache)
|---|
|---|
Core witin processor within socket
|---|

Jim Dempsey

robert-reed · ‎02-25-2008

Oh, no! Desktop versus Server wars! Guess I haven't been checking this forum as often as I might.

I think you've already demonstrated that the E7330 cores run as fast as the Q6600 core in your single threaded test. And looking over the code sample you provided, nothing jumps out at me, but there's one thing that the code doesn't answer: how big are these blocks of data being fed to the decoder instantiations? The only difference in the specs I saw is that the Q6600 has a full 8 MB L2 cache while the E7330 only has a 6 MB L2. If we were just talking one processor versus the other, big enough buffers could result in cache thrashing on the server part with four cores that you might not see on the Q6600. The affinity tests Jim suggested should explore that configuration more thoroughly.

Beyond the processor, there are several other factors that come into play. The quad-socket E7330 uses FBDIMM memory versus DDR2 on the Q6600. The E7330 has 16 cores contending for memory on two busses versus only 4 cores on the Q6600. The code limits processing to 4 threads, but those four threads could land on any one of those 16 cores, as Jim suggested. And the way the code sample works, data for all four cores is read into memory by a single thread. If the buffer is small enough, it may still all reside in a single L2 cache where two cores have immediate access and the other two pay a small penalty, using Intel AdvancedSmart Cache to get to the other L2. Whereas on the E7330, worst case scenario, the 4 threads land on four separate sockets and three will have to wait for the data to be read from memory.