Which thread on which processor - can it be controlled (scheduled)?

bj_cw · ‎01-17-2007

I've Intel Pentium D (dual core). Fedora kernel 2.6.15-1.2054_FC5SMP.

The following code is good enough for checking which thread is working on which core?
Here, I see that all the threads run on only one core.
How can I set affinity of 25 threads to one core and other 25 threads to anothrer core?

The code is here -
/***************************************************/
#include
#include

pthread_mutex_t mutex_printf;

void fillmem()
{
int i;
int j;
unsigned long mask;
for(i=0; i<200; i++)
{
pthread_mutex_lock (&mutex_printf);
pthread_getaffinity_np(pthread_self(), sizeof(mask), &mask);
printf("%d ", mask); /* Print the current proc number */
pthread_mutex_unlock (&mutex_printf);

for(j=0; j<20000; j++); /* Delay */
}
}

int main(int argc, char *argv[])
{
int j;
pthread_t my_thread[50];
unsigned long mask = 1;

pthread_mutex_init(&mutex_printf, NULL);

for(j=0;j<49;j++) /* Loop to fire threads */
{
pthread_create(&my_thread, NULL, fillmem, NULL);
}

for(j=0;j<49;j++)
pthread_join(my_thread, NULL);

pthread_mutex_destroy(&mutex_printf);

pthread_getaffinity_np(pthread_self(), sizeof(mask), &mask);
printf("%d ", mask);
}
/*********************************************************/

How can I set affinity of 25 threads to one core and other 25 threads to anothrer core?
(I tried pthread_setaffinity_np(), sched_setaffinity() ...)
Can you get me a sample code?

Thanks. :-)
BJ_CW

dpotages · ‎01-17-2007

Hi there,

Well i'm not an expert, but i'd suggest you to have a look at the "Detecting Multi-Core Processor Topology in an IA-32 Platform" document. In their sample, they show how to retrieve the amount of cores, and they also get the thread affinity related to each core. Then it's just a matter of using the proper affinity mask for each group of threads.

/david

bj_cw · ‎01-18-2007

(Thanks David)

But in that code,native assembly instructions are used. I'm deloping a portable app. Please consider the following code:

/******************/
#include
#include

void th1()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}

void th2()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}

int main()
{
clock_t initial, final, seconds;
pthread_t my_thread1, my_thread2;

initial = clock ();

pthread_create(&my_thread1, NULL, th1, NULL);

pthread_create(&my_thread2, NULL, th2, NULL);

pthread_join(my_thread2, NULL);
pthread_join(my_thread1, NULL);

final = clock();
printf("time = %lf ", (final-initial)/(double)CLOCKS_PER_SEC );
}
/******************/

This code takes some 46clock timeon single core machine. And some 96 clock time on dual core. (Some times on dual core it executes in some 46 clock time).Why doesn't itexecute in some 23 clock time? Or how can it betweaked to run in 23 clock time?

Thanks, :-)

BJ_CW

TimP · ‎01-18-2007

If those threads run so fast, it indicates your compiler has optimized away those loops. You would need some operations in the loops which the compiler doesn't recognize as do-nothing. So, you are measuring only the time required to set up the threads, which necessarily increases with number of threads.

bj_cw · ‎01-18-2007

(Hi tim18)

The clock() function used here is assumed to be not interfering the thread create, execute and join phases. Is this assumption wrong?

Whether both the threads (th1, th2 in the above code) are executing on two different cores of Intel D? If yes, what is the best way to find out the time lengthof execution?

Or how to confirm that th1 and th2 are running on two different cores?

(The for loop is good enough for unidentified do-nothing, as no optimization level is set.)

-BJ_CW.

bj_cw · ‎01-22-2007

Hi there,

Could anyone play around the code above and found that this code is taking double time on dual core than single core? (On myside -- gcc, Fedora - as mentioned in first post, x86 - as mentioned in first post.)

Don't you expect it to take half the time of single core than double the time?

Any settings to be taken care of (related to gcc, machine bios, NUMA, etc, etc)?

:-)

-BJ_CW

bj_cw · ‎01-24-2007

Hi

One more point to update:

I claim that features of single core and dual core machines, that I'm using, are exactly same.
How?

Machine is: Intel Pentium D (dual core).

Fedora kernel: 2.6.15-1.2054_FC5 for Single core.

Fedora kernel: 2.6.15-1.2054_FC5SMP for Dual core.

Now, when I take one of the codes above and run on this single core, it takes "x" seconds, and on dual core, it takes "2x" secs. (Shouldn't it take "x/2" sec?)
Why this performance degradation?
Do you get similar result on your machine?

What do you suggest me to get performance of "x/2", if "x/2" is right thing to have?

-BJ_CW

jimdempseyatthecove · ‎01-24-2007

BJ_CW,

I took the liberty to modify your source to use OpenMP in lieu of pthreads. On my Windows Server 2003 with 4 processors. Your pthread results should be similar assuming your threads are actually starting as you intend them to start.

#include
#include
#include

void th1()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
 for(j=0; j<10000;j++)
 for(k=0; k<100;k++);
}

void th2()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
 for(j=0; j<10000;j++)
 for(k=0; k<100;k++);
}

void th1P4();// forward reference to variant on th1
void th2P4();// forward reference to variant on th2

int main()
{
clock_t initial, final, seconds;
// Single Thread
printf("Begin single thread test...
");
initial = clock ();
th1();
th2();
final = clock();
printf("time = %lf

", (final-initial)/(double)CLOCKS_PER_SEC );

printf("Begin two thread test...
");
initial = clock ();
#pragma omp parallel sections num_threads(2)
{
 #pragma omp section
 {
 th1();
 }
 #pragma omp section
 {
 th2();
 }
}
final = clock();
printf("time = %lf

", (final-initial)/(double)CLOCKS_PER_SEC );

printf("Begin Four thread test...
");
initial = clock ();
th1P4();
th2P4();
final = clock();
printf("time = %lf

", (final-initial)/(double)CLOCKS_PER_SEC );
}

void th1P4()
{
int i;
int j;
int k;
#pragma omp parallel for num_threads(4) private(i, j, k)
for(i=0; i<10000;i++)
 for(j=0; j<10000;j++)
 for(k=0; k<100;k++);
}

void th2P4()
{
int i;
int j;
int k;
#pragma omp parallel for num_threads(4) private(i, j, k)
for(i=0; i<10000;i++)
 for(j=0; j<10000;j++)
 for(k=0; k<100;k++);
}

------------- Output --------------------

Begin single thread test...
time = 53.343000

Begin two thread test...
time = 26.703000

Begin Four thread test...
time = 13.375000

Jim Dempsey

pbkenned1 · ‎01-24-2007

Use the "time" command and check the "user" and "real" times reported. On 2.6.x Redhat Linux kernels, clock() reports "user" time. This is the sum of CPU time used by all threads and can be greater than "real" time -- which is the wallclock time.

On 2.4.x Redhat Linux kernels, clock() reports "real" time -- ie, true wallclock time.

Here's an example that should allow you to reclaim your sanity. "Matmul kernel wall clock time" is just the delta from reading clock() before and after entering the parallel region.

On my Pentium D box (2 threads) running a Redhat 2.6 kernel,notice that what clock() reports is very close to the "user" value, as reportedby "time" ---

$ cat /proc/version
Linux version 2.6.9-11.ELsmp ([email protected]) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:25:30 EDT 2005

$ icc -openmp matmul_clock.cpp && time ./a.out

Using clock() for wall clock time
Problem size: c(900,3600) = a(900,1800)*b(1800,3600)
Calculating product 5 time(s)
We are using 2 thread(s)...

Matmul kernel wall clock time = 30.45 sec
Wall clock time/thread = 15.225 sec
Expected value for each matrix element is 1620900
Checking that all 3240000 elements of c = 1620900...done

===>>> Solution Validates <<<===

real 0m15.346s
user 0m30.454s
sys 0m0.055s
$

Now, compare that to my hyperthreaded DP Xeon server (4 threads), running a Redhat 2.4 kernel -- you will see that clock() (Matmul kernel wall clock time) is very close to time's "real" time -- and that the "user" time is about 4x the "real" time:

$ cat /proc/version
Linux version 2.4.21-20.EL ([email protected]) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-42)) #1 SMP Wed Aug 18 20:34:58 EDT 2004

$ icc -openmp matmul_clock.cpp && time ./a.out

Using clock() for wall clock time
Problem size: c(900,3600) = a(900,1800)*b(1800,3600)
Calculating product 5 time(s)
We are using 4 thread(s)...

Matmul kernel wall clock time = 17.64 sec
Wall clock time/thread = 4.41 sec
Expected value for each matrix element is 1620900
Checking that all 3240000 elements of c = 1620900...done

===>>> Solution Validates <<<===

real 0m18.089s
user 1m10.160s
sys 0m0.090s
$

Best Regards,

Patrick Kennedy

Intel Developer Support

bj_cw · ‎01-30-2007

(Thanks Patrick Kennedy)

Please consider the following source:
/* http://www.csce.uark.edu/~aapon/courses/os/examples/another.c */
/* I've modified slightly - BJ_CW */

/**************************************************/
/* Another thread example. This one shows that */
/* pthreads in Linux can use both processors in */
/* a dual-processor Pentium. */
/* */
/* Usage: a.out */
/* */
/* To compile me in Linux type: */
/* gcc -o another another.c -lpthread */
/**************************************************/

#include
#include
#include

#define MAX_THREADS 10

int sum; /* this data is shared by the thread(s) */
void *runner(void * param);

main(int argc, char *argv[])
{
int num_threads, i;
pthread_t tid[MAX_THREADS]; /* the thread identifiers */
pthread_attr_t attr; /* set of thread attributes */

if (argc != 2) {
fprintf(stderr, "usage: a.out ");
exit(3);
}

if (atoi(argv[1]) <= 0) {
fprintf(stderr,"%d must be > 0 ", atoi(argv[1]));
exit(1);
}

if (atoi(argv[1]) > MAX_THREADS) {
fprintf(stderr,"%d must be <= %d ", atoi(argv[1]), MAX_THREADS);
exit(2);
}

num_threads = atoi(argv[1]);
printf("The number of threads is %d ", num_threads);

/* get the default attributes */
pthread_attr_init(&attr);

/* create the threads */
for (i=0; i pthread_create(&(tid), &attr, runner, (void *) i);
printf("Creating thread number %d, tid=%lu ", i, tid);
}

/* now wait for the threads to exit */
for (i=0; i pthread_join(tid,NULL);
}

}

/* The thread will begin control in this function */
void *runner(void * param)
{
int i;
int threadnumber = (int) param;
for (i=0; i<1000; i++) printf("Thread number=%d, i=%d ", threadnumber, i);
pthread_exit(0);
}
/*************************************/

When I used the command:
$ time ./a.out 10 > a.txt

the result was worse for dual core than single core -
single core result:

real 0m0.008s
user 0m0.004s
sys 0m0.004s

or

real 0m0.008s
user 0m0.008s
sys 0m0.000s

dual core result:

real 0m0.016s
user 0m0.008s
sys 0m0.020s

or

real 0m0.009s
user 0m0.008s
sys 0m0.000s

The 'real' time for dual core is more than that for single core.
Performance looks like degrading with dual core instead of improving.
Is there something wrong?
Or is something missing?
Do you get similar result on your machine?
Can you reason out?

(The configuration of the system is same as above
Machine is: Intel Pentium D (dual core).
Fedora kernel: 2.6.15-1.2054_FC5 for Single core.
Fedora kernel: 2.6.15-1.2054_FC5SMP for Dual core
)

-BJ_CW

jimdempseyatthecove · ‎01-30-2007

printf is a serialized function (on MP it performs an enter criticla section, prints, exit critical section). And display output rate isn't infinite. With more threads butting into each other the code will take longer to run.

Insert some compute-only code in your runner function

void *runner(void * param)
{
 int i;
 int threadnumber = (int) param;
 printf("Begin Thread number=%d, i=%d
", threadnumber);
 for (i=0; i<1000000; i++)
 if((double)i == 0.5) break;
 printf("End Thread number=%d, i=%d
", threadnumber);
 pthread_exit(0);
}

Jim Dempsey