Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

About Bus Transaction

A way to improve performance, I try to minimized bus transaction.
I read below from "IA-32 Intel Architecture Optimization Chapter 7",
and try to measure performance. But I don't know how I show the performance.
When I turned on the Hyper-Threading, performance went down.
Is there any incorrect part?
I also wonder which factors affect performance.
Test Source is my code. Pleae Review the code.
system spec: H/W : IBM Xseries 225
OS : Redhat Linux 9
compiler : icc 8.0
From IA-32 Intel Architecture Optimization Chapter 7"
Minimize Sharing of Data between Physical Processors
When two threads are executing on two physical processors and sharing
data, reading from or writing to shared data usually involves several bus
transactions (including snooping, request for ownership changes, and
sometimes fetching data across the bus). A thread accessing a large
amount of shared memory is not likely to scale with processor clock
User/Source Coding Rule 31. (H impact, M generality) Minimize the
sharing of data between threads that execute on different physical processors
sharing a common bus.
One technique to minimize sharing of data is to copy data to local stack
variables if it is to be accessed repeatedly over an extended period. If
necessary, results from multiple threads can be combined later by
writing them back to a shared memory location. This approach can also
minimize time spent to synchronize access to shared data.

Test Source
// For Debug
#ifdef DEBUG
#define DPRINTF(arg) printf arg
#define DPRINTF(arg)

#define NUM_PROC 4
#define MAXLEN 1024*1024
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int A[MAXLEN];
int B[MAXLEN];
int C[MAXLEN];
int full_cnt = 1;
void* thread_fn(void *arg) {
int *t1, *t2, *t3;
long count = 0;
int i;
#ifdef NORMAL
for (i=0; ipthread_mutex_lock(&mutex);
for (count=0; countC[count] += A[count] + B[count];
#ifdef FAST
t1 = (int*)malloc(sizeof(int)*MAXLEN);
t2 = (int*)malloc(sizeof(int)*MAXLEN);
t3 = (int*)malloc(sizeof(int)*MAXLEN);
for (count=0; countt1[count] = A[count];
t2[count] = B[count];
t3[count] = t1[count] + t2[count];
for (i=0; ipthread_mutex_lock(&mutex);
for (count=0; countC[count] += t3[count];
int main(int argc, char *argv[])
pthread_t tid[NUM_PROC];
struct timeval start, end, result;
long i;
long j;
if (argc < 2) {
printf("usage: false_none count ");
return 0;
full_cnt = atoi(argv[1]);
for (j=0; j< MAXLEN; j++) {
A = 1;
B = 1;
C = 0;
for (i=0; ipthread_create(&tid, NULL, thread_fn, NULL);

gettimeofday(&star t, NULL);
for (i=0; ipthread_join(tid, NULL);
gettimeofday(&end, NULL);
timersub(&end, &start, &result);
printf("%ld sec, %ld usec ", result.tv_sec, result.tv_usec);
#ifdef DEBUG
for (i=255; i< 275; i++)
printf(" ");
return 0;
0 Kudos
1 Reply
New Contributor I

Persepone -

Are you running the same number of threads as before? They could be scheduled on the same physical processor (in a dual HT platform) which is what you are trying to avoid with your division of data. Even with four threads in a dual HT enabled system, you will have two threads assigned to the same processor with sharing or even splitting of cache and other processor resources.

-- clay

0 Kudos