Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Pause Instruction

vishaln
Beginner
440 Views

Hi,

I was trying to pass data from a producer thread that is

- affinitized to run on one core (say 10) [PRODUCER]
- to another thread affinitized to run on another core (say 11) in a Busy Loop [CONSUMER]

The processor used is Intel Xeon CPU X7350 @ 2.93GHz

The producer thread writes the tick count to a cache aligned variable and goes to sleep for a second. The consumer thread which is busy waiting on the shared variable to change reads this tick count generated from by the producer. It then sees the difference w.r.t the the current tick count. This process is repeated 10 times

(Code at end of post)

Sample result is as follows

count,ticks,Usec(ticks/3000)
0,2310,0.77
1,2882,0.960667
2,1694,0.564667
3,1551,0.517
4,2288,0.762667
5,1826,0.608667
6,1958,0.652667
7,2618,0.872667
8,2541,0.847
9,1870,0.623333

There are a couple of things that I find distrubing

1. It seems like it takes ~2000+ cycles to notify a consumer of data change
2. Adding the PAUSE instruction seems to have to effect on the numbers (http://software.intel.com/file/27087)

It would be great if anyone could help me with this. I'm still feeling my way around parallel programming. Guidance is greatly appreciated

Thanks

======================================================

#define CACHE_LINE_SIZE (64)
volatileint64_tg_t __attribute__ ((aligned (CACHE_LINE_SIZE))) = 0;

const static int COUNT=10;
int64_t lat_recv[COUNT];

int64_t ticks()
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

void producer() {
for( int i = 0; i < COUNT; ++i ) {
sleep(1);
g_t = ticks();
}
}

void receiver() {
for( int i = 0; i < COUNT; ++i ) {
while( !g_t ) __asm__ __volatile__ ( "pause" ) ;
lat_recv[ i ] = ticks() - g_t;
g_t = 0;
}
}

// results
for( long i = 0; i < COUNT; ++i )
{
std::cout << i << "," << lat_recv << ","
<< static_cast( lat_recv )/(3*1000) << std::endl;
}


0 Kudos
1 Reply
jimdempseyatthecove
Honored Contributor III
440 Views
What happens when you affinitize one thread to (01) and the second thread to (10).
Affinity is a bit mask. The thread in your old example with affinity of 11 could run on 01 or 10, which means it could run on the same core as your other thread with affinity of 10.

Second experiment: Replace pause with nop

Third experiment: replace pause/nop with monitor of address g_t then mwait (for memory change at address in proximityof monitored address).

Jim Dempsey

0 Kudos
Reply