It is easy for more threads

ye_f_1 · ‎07-07-2016

hi, I run a big program on cpu and phi in symmetrical mode.But I find more thread will decrease performance on phi.The structure of program is very simple.The following content is the outline of program. #define move_threads 57 #define compute_threads 57 int a[1000000]; // a big array int compute_count[compute_threads]; inline void * move() { } inline void * status_count(){ } inline void *compute(void * arg){ int * id=(int *)arg; int count=0; for(int i=*id;i<1000000;i+=compute_threads){ count+=a; } compute_count[*id]=count; return (void *)0; } int main(){ for(int it=0;i<20;i++){ .............. for(.........){ pthread_create(&move_tid,NULL,move,NULL); } ............. for(int i=0;i

ye_f_1 · ‎07-07-2016

hi,
I run a big program on cpu and phi in symmetrical mode.But I find more thread will decrease performance on phi.The structure of program is very simple.The following content is the outline of program.

#define move_threads 57
#define compute_threads 57

int a[1000000]; // a big array
int compute_count[compute_threads];
inline void * move()
{
}

inline void * status_count(){
}

inline void *compute(void * arg){
   int * id=(int *)arg;

   int count=0;
   for(int i=*id;i<1000000;i+=compute_threads){
       count+=a;
   }
   compute_count[*id]=count;
   return (void *)0;
}

int
main(){

for(int it=0;i<20;i++){

   ..............
   for(.........){
       pthread_create(&move_tid,NULL,move,NULL);
   }

   .............
   for(int i=0;i<move_threads;i++){
       pthread_join(move_tid,NULL);
   }
   ..............

   for(int i=0;i<coumpute_threads;i++){
       id=i;
       pthread_create(&compute_tid,NULL,compute,&id);
   }

   double before_compute=rtclock();
   for(int i=0;i<coumpute_threads;i++){
       pthread_join(compute_tid,NULL);
   }
   double after_compute=rtclock();
   double t_compute=after_compute-before_compute;

   ...........................

   total_compute+=   t_compute;
}
cout<<total_compute<<endl;
}

These are many functions like compute and move in program.In first time,compute_threads is set as 18,I find the result of total_compute is 0.0100641 second.In second time, compute_threads is set as 57, the result of total_compute is 0.0289173 second.In third time,compute_threads is equal to 114,the total_compute is 0.0645232 second.In fourth time.compute_threads is equal to 228,the total_compute is 0.11988 second.
It is seems that more threads will decrease the performance on phi.What's wrong ?

I clear all the functions except compute.In this circumstance,more threads will increase.

By the way,There are 57 cores in my phi.

ye_f_1 · ‎07-07-2016

hi,
I run a big program on cpu and phi in symmetrical mode.But I find more thread will decrease performance on phi.The structure of program is very simple.The following content is the outline of program.

#define move_threads 57
#define compute_threads 57

int a[1000000]; // a big array
int compute_count[compute_threads];
inline void * move()
{
}

inline void * status_count(){
}

inline void *compute(void * arg){
   int * id=(int *)arg;

   int count=0;
   for(int i=*id;i<1000000;i+=compute_threads){
       count+=a;
   }
   compute_count[*id]=count;
   return (void *)0;
}

int
main(){

for(int it=0;i<20;i++){

   ..............
   for(.........){
       pthread_create(&move_tid,NULL,move,NULL);
   }

   .............
   for(int i=0;i<move_threads;i++){
       pthread_join(move_tid,NULL);
   }
   ..............

   for(int i=0;i<coumpute_threads;i++){
       id=i;
       pthread_create(&compute_tid,NULL,compute,&id);
   }

   double before_compute=rtclock();
   for(int i=0;i<coumpute_threads;i++){
       pthread_join(compute_tid,NULL);
   }
   double after_compute=rtclock();
   double t_compute=after_compute-before_compute;

   ...........................

   total_compute+=   t_compute;
}
cout<<total_compute<<endl;
}

These are many functions like compute and move in program.In first time,compute_threads is set as 18,I find the result of total_compute is 0.0100641 second.In second time, compute_threads is set as 57, the result of total_compute is 0.0289173 second.In third time,compute_threads is equal to 114,the total_compute is 0.0645232 second.In fourth time.compute_threads is equal to 228,the total_compute is 0.11988 second.
It is seems that more threads will decrease the performance on phi.What's wrong ?

I clear all the functions except compute.In this circumstance,more threads will increase.

By the way,There are 57 cores in my phi.

Gregg_S_Intel · ‎07-07-2016

Is rtclock() measuring elapsed time or CPU time?

ye_f_1 · ‎07-07-2016

rtclock() measuring elapsed time.

#include "util.h"
#include <stdio.h>
#include <sys/time.h>

double rtclock() {
        struct timezone Tzp;
        struct timeval Tp;
        int stat;
        stat = gettimeofday (&Tp, &Tzp);
        if (stat != 0) printf("Error return from gettimeofday: %d",stat);
        return(Tp.tv_sec + Tp.tv_usec*1.0e-6);
}

McCalpinJohn · ‎07-07-2016

It is easy for more threads to reduce overall performance...

Three common cases are:

The overhead of parallel synchronization constructs increases with thread count, so if the amount of work per synchronization is too small, total execution time can increase.
Running more than one thread per physical core will result in less effective cache capacity per thread. This can increase the overall cache miss rate and the overall memory traffic and thereby increase execution time.
Running more than one thread per physical core will increase the number of memory access streams. If the number of memory access streams exceed the number of DRAM banks, bank conflict rates will increase. This leads to more DRAM stall cycles and (if ECC is enabled) to more DRAM reads/writes for the ECC data, again leading to increased execution time.

jimdempseyatthecove · ‎07-07-2016

Your runtime is very short, each time you issue pthread_create/pthread_join you encounter some overhead. It is much better to use one of the thread pooling paradigms (OpenMP, TBB, Cilk++, ...).

Also, your code is not using any affinity pinning (available with OpenMP via environment variables, by API with the others).

#define move_threads 57
#define compute_threads 57

int a[1000000];  // a big array
 int compute_count[compute_threads];
 inline void * move()
 {
 }

inline void * status_count(){
 }

inline void *compute(void * arg){
     int * id=(int *)arg;
     
     int count=0;
     for(int i=*id;i<1000000;i+=compute_threads){
         count+=a;
     }
     compute_count[*id]=count;
     return (void *)0;
 }

 

int main()
{
   const int repCount = 20;
   double iteration_compute_time[repCount];
     
   for(int it=0;i<repCount;i++){
     #pragma omp parallel for num_threads(move_threads) schedule(static,1)
     for(int i=0;i<move_threads;i++){
     {
        move();
     } // implicit barrier here
     

 

    double before_compute=rtclock(); // you had this placed in the wrong location
    #pragma omp parallel for num_threads(compute_threads) schedule(static,1)
    for(int i=0;i<coumpute_threads;i++){
         id=i;
         compute(&id);
     } // implicit barrier here

     double after_compute=rtclock();
     iteration_compute_time=after_compute-before_compute;

          
     total_compute+=    iteration_compute_time;    
   }    
   for(int i=0;i<repCount;i++){
     cout<<iteration_compute_time<<endl;
   }
   cout<<total_compute<<endl;
 }

The above is untested code.

The above assumes all the moves are performed independently (and completely) of the compute. You might find it beneficial to have the move_threads == compute_threads (always), then setup the move section (portioning of data) the same as the compute section. Assuming there is not temporal dependencies between partitions. This way you can use one #pragma omp parallel region with each thread performing a move of a section followed by a compute. In this manner, you can construct the partitions, such that each partition data usage fits within L1 cache and the number of partitions is independent from the number of threads. (the above has the partition size inversely proportional to the number of threads).

Jim Dempsey

ye_f_1 · ‎07-07-2016

Hi,Jim
   I change pthread to openmp and set KMP_AFFINITY=scatter.But I find the same problem.

   compute_threads       total_compute
        18           0.00459003

        57                         0.00944662

        114           0.0506935

228 0.234504

jimdempseyatthecove · ‎07-08-2016

Try performing some actual work:

#define move_threads 57
#define compute_threads 57

const int size_a = 1000000;
int a[size_a];  // a big array

int compute_count[compute_threads];

inline void * move()
{
  for(int i=0;i<size_a;++i){
    a = i&15+2; // some number between 2 and 17
}

inline void * status_count(){
 }


// Serial recursive method to calculate Fibonacci series
// do not inline this (cannot be inlined because it is a recursive function)
int SerialFib( int n )
{
 if( n<2 )
  return n;
 else
  return SerialFib(n-1)+SerialFib(n-2);
}

inline void *compute(void * arg){
     int * id=(int *)arg;
     
     int count=0;
     for(int i=*id;i<size_a;i+=compute_threads){
         count+=a; // sum of (some number between 2 and 17)
         // perform some actual work
         a = SerialFib(a);
     }
     compute_count[*id]=count;
     return (void *)0;
 }

 

int main()
{
   const int repCount = 20;
   double iteration_compute_time[repCount];
     
   for(int it=0;i<repCount;i++){
     #pragma omp parallel for num_threads(move_threads) schedule(static,1)
     for(int i=0;i<move_threads;i++){
     {
        move();  // re-initialize array a
     } // implicit barrier here
     

 

    double before_compute=rtclock(); // you had this placed in the wrong location
    #pragma omp parallel for num_threads(compute_threads) schedule(static,1)
    for(int i=0;i<coumpute_threads;i++){
         id=i;
         compute(&id);
     } // implicit barrier here

     double after_compute=rtclock();
     iteration_compute_time=after_compute-before_compute;

          
     total_compute+=    iteration_compute_time;    
   }    
   for(int i=0;i<repCount;i++){
     cout<<iteration_compute_time<<endl;
   }
   cout<<total_compute<<endl;
 }

The above is untested.

Jim Dempsey

jimdempseyatthecove · ‎07-09-2016

ye,

My post #9 is not correct. My intention was to have your compute function perform significantly more computation than memory R/W. Unfortunately, the recursive SerialFib is mostely memory I/O (pushing return address/popping return address).

Replace the SerialFib function with something that does some computation.

Also, bear in mind the strongest feature of the Xeon Phi is its wide vectors. See if you can construct a representative computation section that is both computationally intensive (without a lot of non-cachable memory references) and is amenable to vectorization.

Jim Demspey