- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi, I run a big program on cpu and phi in symmetrical mode.But I find more thread will decrease performance on phi.The structure of program is very simple.The following content is the outline of program. #define move_threads 57 #define compute_threads 57 int a[1000000]; // a big array int compute_count[compute_threads]; inline void * move() { } inline void * status_count(){ } inline void *compute(void * arg){ int * id=(int *)arg; int count=0; for(int i=*id;i<1000000;i+=compute_threads){ count+=a; } compute_count[*id]=count; return (void *)0; } int main(){ for(int it=0;i<20;i++){ .............. for(.........){ pthread_create(&move_tid,NULL,move,NULL); } ............. for(int i=0;i
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
I run a big program on cpu and phi in symmetrical mode.But I find more thread will decrease performance on phi.The structure of program is very simple.The following content is the outline of program.
#define move_threads 57
#define compute_threads 57
int a[1000000]; // a big array
int compute_count[compute_threads];
inline void * move()
{
}
inline void * status_count(){
}
inline void *compute(void * arg){
int * id=(int *)arg;
int count=0;
for(int i=*id;i<1000000;i+=compute_threads){
count+=a;
}
compute_count[*id]=count;
return (void *)0;
}
int
main(){
for(int it=0;i<20;i++){
..............
for(.........){
pthread_create(&move_tid,NULL,move,NULL);
}
.............
for(int i=0;i<move_threads;i++){
pthread_join(move_tid,NULL);
}
..............
for(int i=0;i<coumpute_threads;i++){
id=i;
pthread_create(&compute_tid,NULL,compute,&id);
}
double before_compute=rtclock();
for(int i=0;i<coumpute_threads;i++){
pthread_join(compute_tid,NULL);
}
double after_compute=rtclock();
double t_compute=after_compute-before_compute;
...........................
total_compute+= t_compute;
}
cout<<total_compute<<endl;
}
These are many functions like compute and move in program.In first time,compute_threads is set as 18,I find the result of total_compute is 0.0100641 second.In second time, compute_threads is set as 57, the result of total_compute is 0.0289173 second.In third time,compute_threads is equal to 114,the total_compute is 0.0645232 second.In fourth time.compute_threads is equal to 228,the total_compute is 0.11988 second.
It is seems that more threads will decrease the performance on phi.What's wrong ?
I clear all the functions except compute.In this circumstance,more threads will increase.
By the way,There are 57 cores in my phi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
I run a big program on cpu and phi in symmetrical mode.But I find more thread will decrease performance on phi.The structure of program is very simple.The following content is the outline of program.
#define move_threads 57
#define compute_threads 57
int a[1000000]; // a big array
int compute_count[compute_threads];
inline void * move()
{
}
inline void * status_count(){
}
inline void *compute(void * arg){
int * id=(int *)arg;
int count=0;
for(int i=*id;i<1000000;i+=compute_threads){
count+=a;
}
compute_count[*id]=count;
return (void *)0;
}
int
main(){
for(int it=0;i<20;i++){
..............
for(.........){
pthread_create(&move_tid,NULL,move,NULL);
}
.............
for(int i=0;i<move_threads;i++){
pthread_join(move_tid,NULL);
}
..............
for(int i=0;i<coumpute_threads;i++){
id=i;
pthread_create(&compute_tid,NULL,compute,&id);
}
double before_compute=rtclock();
for(int i=0;i<coumpute_threads;i++){
pthread_join(compute_tid,NULL);
}
double after_compute=rtclock();
double t_compute=after_compute-before_compute;
...........................
total_compute+= t_compute;
}
cout<<total_compute<<endl;
}
These are many functions like compute and move in program.In first time,compute_threads is set as 18,I find the result of total_compute is 0.0100641 second.In second time, compute_threads is set as 57, the result of total_compute is 0.0289173 second.In third time,compute_threads is equal to 114,the total_compute is 0.0645232 second.In fourth time.compute_threads is equal to 228,the total_compute is 0.11988 second.
It is seems that more threads will decrease the performance on phi.What's wrong ?
I clear all the functions except compute.In this circumstance,more threads will increase.
By the way,There are 57 cores in my phi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is rtclock() measuring elapsed time or CPU time?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rtclock() measuring elapsed time.
#include "util.h"
#include <stdio.h>
#include <sys/time.h>
double rtclock() {
struct timezone Tzp;
struct timeval Tp;
int stat;
stat = gettimeofday (&Tp, &Tzp);
if (stat != 0) printf("Error return from gettimeofday: %d",stat);
return(Tp.tv_sec + Tp.tv_usec*1.0e-6);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is easy for more threads to reduce overall performance...
Three common cases are:
- The overhead of parallel synchronization constructs increases with thread count, so if the amount of work per synchronization is too small, total execution time can increase.
- Running more than one thread per physical core will result in less effective cache capacity per thread. This can increase the overall cache miss rate and the overall memory traffic and thereby increase execution time.
- Running more than one thread per physical core will increase the number of memory access streams. If the number of memory access streams exceed the number of DRAM banks, bank conflict rates will increase. This leads to more DRAM stall cycles and (if ECC is enabled) to more DRAM reads/writes for the ECC data, again leading to increased execution time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your runtime is very short, each time you issue pthread_create/pthread_join you encounter some overhead. It is much better to use one of the thread pooling paradigms (OpenMP, TBB, Cilk++, ...).
Also, your code is not using any affinity pinning (available with OpenMP via environment variables, by API with the others).
#define move_threads 57 #define compute_threads 57 int a[1000000]; // a big array int compute_count[compute_threads]; inline void * move() { } inline void * status_count(){ } inline void *compute(void * arg){ int * id=(int *)arg; int count=0; for(int i=*id;i<1000000;i+=compute_threads){ count+=a; } compute_count[*id]=count; return (void *)0; } int main() { const int repCount = 20; double iteration_compute_time[repCount]; for(int it=0;i<repCount;i++){ #pragma omp parallel for num_threads(move_threads) schedule(static,1) for(int i=0;i<move_threads;i++){ { move(); } // implicit barrier here double before_compute=rtclock(); // you had this placed in the wrong location #pragma omp parallel for num_threads(compute_threads) schedule(static,1) for(int i=0;i<coumpute_threads;i++){ id=i; compute(&id); } // implicit barrier here double after_compute=rtclock(); iteration_compute_time=after_compute-before_compute; total_compute+= iteration_compute_time; } for(int i=0;i<repCount;i++){ cout<<iteration_compute_time<<endl; } cout<<total_compute<<endl; }
The above is untested code.
The above assumes all the moves are performed independently (and completely) of the compute. You might find it beneficial to have the move_threads == compute_threads (always), then setup the move section (portioning of data) the same as the compute section. Assuming there is not temporal dependencies between partitions. This way you can use one #pragma omp parallel region with each thread performing a move of a section followed by a compute. In this manner, you can construct the partitions, such that each partition data usage fits within L1 cache and the number of partitions is independent from the number of threads. (the above has the partition size inversely proportional to the number of threads).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,Jim
I change pthread to openmp and set KMP_AFFINITY=scatter.But I find the same problem.
compute_threads total_compute
18 0.00459003
57 0.00944662
114 0.0506935
228 0.234504
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try performing some actual work:
#define move_threads 57 #define compute_threads 57 const int size_a = 1000000; int a[size_a]; // a big array int compute_count[compute_threads]; inline void * move() { for(int i=0;i<size_a;++i){ a = i&15+2; // some number between 2 and 17 } inline void * status_count(){ } // Serial recursive method to calculate Fibonacci series // do not inline this (cannot be inlined because it is a recursive function) int SerialFib( int n ) { if( n<2 ) return n; else return SerialFib(n-1)+SerialFib(n-2); } inline void *compute(void * arg){ int * id=(int *)arg; int count=0; for(int i=*id;i<size_a;i+=compute_threads){ count+=a; // sum of (some number between 2 and 17) // perform some actual work a = SerialFib(a); } compute_count[*id]=count; return (void *)0; } int main() { const int repCount = 20; double iteration_compute_time[repCount]; for(int it=0;i<repCount;i++){ #pragma omp parallel for num_threads(move_threads) schedule(static,1) for(int i=0;i<move_threads;i++){ { move(); // re-initialize array a } // implicit barrier here double before_compute=rtclock(); // you had this placed in the wrong location #pragma omp parallel for num_threads(compute_threads) schedule(static,1) for(int i=0;i<coumpute_threads;i++){ id=i; compute(&id); } // implicit barrier here double after_compute=rtclock(); iteration_compute_time=after_compute-before_compute; total_compute+= iteration_compute_time; } for(int i=0;i<repCount;i++){ cout<<iteration_compute_time<<endl; } cout<<total_compute<<endl; }
The above is untested.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ye,
My post #9 is not correct. My intention was to have your compute function perform significantly more computation than memory R/W. Unfortunately, the recursive SerialFib is mostely memory I/O (pushing return address/popping return address).
Replace the SerialFib function with something that does some computation.
Also, bear in mind the strongest feature of the Xeon Phi is its wide vectors. See if you can construct a representative computation section that is both computationally intensive (without a lot of non-cachable memory references) and is amenable to vectorization.
Jim Demspey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page