- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code is good enough for checking which thread is working on which core?
Here, I see that all the threads run on only one core.
How can I set affinity of 25 threads to one core and other 25 threads to anothrer core?
The code is here -
/***************************************************/
#include
#include
pthread_mutex_t mutex_printf;
void fillmem()
{
int i;
int j;
unsigned long mask;
for(i=0; i<200; i++)
{
pthread_mutex_lock (&mutex_printf);
pthread_getaffinity_np(pthread_self(), sizeof(mask), &mask);
printf("%d ", mask); /* Print the current proc number */
pthread_mutex_unlock (&mutex_printf);
for(j=0; j<20000; j++); /* Delay */
}
}
int main(int argc, char *argv[])
{
int j;
pthread_t my_thread[50];
unsigned long mask = 1;
pthread_mutex_init(&mutex_printf, NULL);
for(j=0;j<49;j++) /* Loop to fire threads */
{
pthread_create(&my_thread
}
for(j=0;j<49;j++)
pthread_join(my_thread
pthread_mutex_destroy(&mutex_printf);
pthread_getaffinity_np(pthread_self(), sizeof(mask), &mask);
printf("%d ", mask);
}
/*********************************************************/
How can I set affinity of 25 threads to one core and other 25 threads to anothrer core?
(I tried pthread_setaffinity_np(), sched_setaffinity() ...)
Can you get me a sample code?
Thanks. :-)
BJ_CW
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
Well i'm not an expert, but i'd suggest you to have a look at the "Detecting Multi-Core Processor Topology in an IA-32 Platform" document. In their sample, they show how to retrieve the amount of cores, and they also get the thread affinity related to each core. Then it's just a matter of using the proper affinity mask for each group of threads.
/david
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Thanks David)
But in that code,native assembly instructions are used. I'm deloping a portable app. Please consider the following code:
/******************/
#include
#include
void th1()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}
void th2()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}
int main()
{
clock_t initial, final, seconds;
pthread_t my_thread1, my_thread2;
initial = clock ();
pthread_create(&my_thread1, NULL, th1, NULL);
pthread_create(&my_thread2, NULL, th2, NULL);
pthread_join(my_thread2, NULL);
pthread_join(my_thread1, NULL);
final = clock();
printf("time = %lf
", (final-initial)/(double)CLOCKS_PER_SEC );
}
/******************/
This code takes some 46clock timeon single core machine. And some 96 clock time on dual core. (Some times on dual core it executes in some 46 clock time).Why doesn't itexecute in some 23 clock time? Or how can it betweaked to run in 23 clock time?
Thanks, :-)
BJ_CW
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Hi tim18)
The clock() function used here is assumed to be not interfering the thread create, execute and join phases. Is this assumption wrong?
Whether both the threads (th1, th2 in the above code) are executing on two different cores of Intel D? If yes, what is the best way to find out the time lengthof execution?
Or how to confirm that th1 and th2 are running on two different cores?
(The for loop is good enough for unidentified do-nothing, as no optimization level is set.)
-BJ_CW.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
Could anyone play around the code above and found that this code is taking double time on dual core than single core? (On myside -- gcc, Fedora - as mentioned in first post, x86 - as mentioned in first post.)
Don't you expect it to take half the time of single core than double the time?
Any settings to be taken care of (related to gcc, machine bios, NUMA, etc, etc)?
:-)
-BJ_CW
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One more point to update:
I claim that features of single core and dual core machines, that I'm using, are exactly same.
How?
Machine is: Intel Pentium D (dual core).
Fedora kernel: 2.6.15-1.2054_FC5 for Single core.
Fedora kernel: 2.6.15-1.2054_FC5SMP for Dual core.
Now, when I take one of the codes above and run on this single core, it takes "x" seconds, and on dual core, it takes "2x" secs. (Shouldn't it take "x/2" sec?)
Why this performance degradation?
Do you get similar result on your machine?
What do you suggest me to get performance of "x/2", if "x/2" is right thing to have?
-BJ_CW
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BJ_CW,
I took the liberty to modify your source to use OpenMP in lieu of pthreads. On my Windows Server 2003 with 4 processors. Your pthread results should be similar assuming your threads are actually starting as you intend them to start.
#include
#include
#include
void th1()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}
void th2()
{
int i;
int j;
int k;
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}
void th1P4();// forward reference to variant on th1
void th2P4();// forward reference to variant on th2
int main()
{
clock_t initial, final, seconds;
// Single Thread
printf("Begin single thread test... ");
initial = clock ();
th1();
th2();
final = clock();
printf("time = %lf ", (final-initial)/(double)CLOCKS_PER_SEC );
printf("Begin two thread test... ");
initial = clock ();
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
th1();
}
#pragma omp section
{
th2();
}
}
final = clock();
printf("time = %lf ", (final-initial)/(double)CLOCKS_PER_SEC );
printf("Begin Four thread test... ");
initial = clock ();
th1P4();
th2P4();
final = clock();
printf("time = %lf ", (final-initial)/(double)CLOCKS_PER_SEC );
}
void th1P4()
{
int i;
int j;
int k;
#pragma omp parallel for num_threads(4) private(i, j, k)
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}
void th2P4()
{
int i;
int j;
int k;
#pragma omp parallel for num_threads(4) private(i, j, k)
for(i=0; i<10000;i++)
for(j=0; j<10000;j++)
for(k=0; k<100;k++);
}
------------- Output --------------------
Begin single thread test...
time = 53.343000
Begin two thread test...
time = 26.703000
Begin Four thread test...
time = 13.375000
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use the "time" command and check the "user" and "real" times reported. On 2.6.x Redhat Linux kernels, clock() reports "user" time. This is the sum of CPU time used by all threads and can be greater than "real" time -- which is the wallclock time.
On 2.4.x Redhat Linux kernels, clock() reports "real" time -- ie, true wallclock time.
Here's an example that should allow you to reclaim your sanity. "Matmul kernel wall clock time" is just the delta from reading clock() before and after entering the parallel region.
On my Pentium D box (2 threads) running a Redhat 2.6 kernel,notice that what clock() reports is very close to the "user" value, as reportedby "time" ---
$ cat /proc/version
Linux version 2.6.9-11.ELsmp (bhcompile@crowe.devel.redhat.com) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:25:30 EDT 2005
$ icc -V
Intel C Compiler for Intel EM64T-based applications, Version 9.1 Build 20070109 Package ID: l_cc_c_9.1.046
Copyright (C) 1985-2007 Intel Corporation. All rights reserved.
$ icc -openmp matmul_clock.cpp && time ./a.out
Using clock() for wall clock time
Problem size: c(900,3600) = a(900,1800)*b(1800,3600)
Calculating product 5 time(s)
We are using 2 thread(s)...
Matmul kernel wall clock time = 30.45 sec
Wall clock time/thread = 15.225 sec
Expected value for each matrix element is 1620900
Checking that all 3240000 elements of c
===>>> Solution Validates <<<===
real 0m15.346s
user 0m30.454s
sys 0m0.055s
$
Now, compare that to my hyperthreaded DP Xeon server (4 threads), running a Redhat 2.4 kernel -- you will see that clock() (Matmul kernel wall clock time) is very close to time's "real" time -- and that the "user" time is about 4x the "real" time:
$ cat /proc/version
Linux version 2.4.21-20.EL (bhcompile@dolly.build.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-42)) #1 SMP Wed Aug 18 20:34:58 EDT 2004
$ icc -V
Intel C Compiler for Intel EM64T-based applications, Version 9.1 Build 20070109 Package ID: l_cc_c_9.1.046
Copyright (C) 1985-2007 Intel Corporation. All rights reserved.
$ icc -openmp matmul_clock.cpp && time ./a.out
Using clock() for wall clock time
Problem size: c(900,3600) = a(900,1800)*b(1800,3600)
Calculating product 5 time(s)
We are using 4 thread(s)...
Matmul kernel wall clock time = 17.64 sec
Wall clock time/thread = 4.41 sec
Expected value for each matrix element is 1620900
Checking that all 3240000 elements of c
===>>> Solution Validates <<<===
real 0m18.089s
user 1m10.160s
sys 0m0.090s
$
Best Regards,
Patrick Kennedy
Intel Developer Support
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Thanks Patrick Kennedy)
Please consider the following source:
/* http://www.csce.uark.edu/~aapon/courses/os/examples/another.c */
/* I've modified slightly - BJ_CW */
/**************************************************/
/* Another thread example. This one shows that */
/* pthreads in Linux can use both processors in */
/* a dual-processor Pentium. */
/* */
/* Usage: a.out
/* */
/* To compile me in Linux type: */
/* gcc -o another another.c -lpthread */
/**************************************************/
#include
#include
#include
#define MAX_THREADS 10
int sum; /* this data is shared by the thread(s) */
void *runner(void * param);
main(int argc, char *argv[])
{
int num_threads, i;
pthread_t tid[MAX_THREADS]; /* the thread identifiers */
pthread_attr_t attr; /* set of thread attributes */
if (argc != 2) {
fprintf(stderr, "usage: a.out
exit(3);
}
if (atoi(argv[1]) <= 0) {
fprintf(stderr,"%d must be > 0 ", atoi(argv[1]));
exit(1);
}
if (atoi(argv[1]) > MAX_THREADS) {
fprintf(stderr,"%d must be <= %d ", atoi(argv[1]), MAX_THREADS);
exit(2);
}
num_threads = atoi(argv[1]);
printf("The number of threads is %d ", num_threads);
/* get the default attributes */
pthread_attr_init(&attr);
/* create the threads */
for (i=0; i
printf("Creating thread number %d, tid=%lu ", i, tid);
}
/* now wait for the threads to exit */
for (i=0; i
}
}
/* The thread will begin control in this function */
void *runner(void * param)
{
int i;
int threadnumber = (int) param;
for (i=0; i<1000; i++) printf("Thread number=%d, i=%d ", threadnumber, i);
pthread_exit(0);
}
/*************************************/
When I used the command:
$ time ./a.out 10 > a.txt
the result was worse for dual core than single core -
single core result:
real 0m0.008s
user 0m0.004s
sys 0m0.004s
or
real 0m0.008s
user 0m0.008s
sys 0m0.000s
dual core result:
real 0m0.016s
user 0m0.008s
sys 0m0.020s
or
real 0m0.009s
user 0m0.008s
sys 0m0.000s
The 'real' time for dual core is more than that for single core.
Performance looks like degrading with dual core instead of improving.
Is there something wrong?
Or is something missing?
Do you get similar result on your machine?
Can you reason out?
(The configuration of the system is same as above
Machine is: Intel Pentium D (dual core).
Fedora kernel: 2.6.15-1.2054_FC5 for Single core.
Fedora kernel: 2.6.15-1.2054_FC5SMP for Dual core
)
-BJ_CW
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
printf is a serialized function (on MP it performs an enter criticla section, prints, exit critical section). And display output rate isn't infinite. With more threads butting into each other the code will take longer to run.
Insert some compute-only code in your runner function
void *runner(void * param)
{
int i;
int threadnumber = (int) param;
printf("Begin Thread number=%d, i=%d ", threadnumber);
for (i=0; i<1000000; i++)
if((double)i == 0.5) break;
printf("End Thread number=%d, i=%d ", threadnumber);
pthread_exit(0);
}
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page