@JJK,

Qiang_L_ · ‎05-21-2015

Hi,

I try to use MKL to generate lots of random data every time on Xeon phi, but the performance is very bad comparing the performance on Xeon CPU.(E5620) .

The attachment is the original code, and the compile option for Xeon Phi is -O3 -mkl -mmic. and it takes about 115 seconds, however when I run it on Xeon CPU,it only takes 3.5 seconds. I do not know why the difference is so much. Is the way in which I use the Xeon Phi wrong or the real performance on Xeon Phi is bad?

Thank you!

Qiang

TimP · ‎05-21-2015

You would need parallel random number generators to get a useful comparison.

Qiang_L_ · ‎05-21-2015

Hi Tim,

In my tests, both are tested on one core using MKL. and how to use parallel random number generator? Because now I am using MKL with this function vdRngUniform(...),is MKL paralleled already? I think the MKL implementation make use of the vectorization already.

Thank you!

Qiang

JJK · ‎05-21-2015

nope it does not - the MKL is parallel-ready, that is, its functions can be used in a parallel application, but *you* need to write the parallel application. a very dumb approach to your sample would be

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#include <sys/time.h>

#include "mkl_vsl.h"

#define ALIGN 64
#define RANDN 23*10000


int main (int argc, char *argv[])
{
    int nthreads, tid;
    int seed = 0;
    VSLStreamStatePtr Randomstream;
    vslNewStream(&Randomstream,VSL_BRNG_MCG31,0);

    __declspec(align(4096)) double l_Random[RANDN];
    double t_start,t=0.0;

    double res=0;
    int timestep = 10000;

    t_start = omp_get_wtime();
    #pragma omp parallel for 
    for(int i=0;i<timestep;i++)
    {
        vdRngUniform(0, Randomstream, RANDN, l_Random, 0.0, 1.0-(1.2e-12));

        for(int j=0;j<RANDN;j++)
            res+=l_Random;
    }
    t = omp_get_wtime() - t_start;

    printf("res = %lf with time consuming %lf \n",res,t);

    //_mm_free(l_Random);

    vslDeleteStream(&Randomstream);
}

Qiang_L_ · ‎05-22-2015

@JJK,

Hi , Thank you for your reply. Sorry I should not put the time step loop in the code here, this code only simplify my complex application, in my real application the time step loop is dependent, it is not possible to use openmp over the time step loop. what I wonder is why the difference of the performance of vdRngUniform function on one core is so big between Xeon CPU and Xeon Phi without using openmp. It is almost 40 times slower on one core of Xeon Phi.

However, as you said, I can use openmp, I think if I use OpenMP to generate random data, I should use like this:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#include <sys/time.h>

#include "mkl_vsl.h"

#define ALIGN 64

int main (int argc, char *argv[])
{
int nthreads, tid;
#pragma omp parallel
{
nthreads = omp_get_num_threads();
}
double *ress = (double *)_mm_malloc(nthreads*sizeof(double),ALIGN);

VSLStreamStatePtr *StreaMPtr;
StreaMPtr = (VSLStreamStatePtr *)_mm_malloc(nthreads*sizeof(VSLStreamStatePtr),ALIGN);
for(int i=0;i<nthreads;i++)
vslNewStream(&StreaMPtr,VSL_BRNG_MCG31,i);

#pragma omp parallel
{
int tid = omp_get_thread_num();
int N = 1024;
double *rand = (double *)_mm_malloc(N*sizeof(double),ALIGN);
#pragma omp for
for(int i=0;i<nthreads;i++)
vdRngUniform(0, StreaMPtr, N, rand, 0.0, 1.0-(1.2e-12));
for(int j=0;j<N;j++)
ress[tid]+=rand;
}

for(int k = 0;k<nthreads;k++)

printf("ress[%d] = %lf \n",k,ress);*/
for(int i=0;i<nthreads;i++)
vslDeleteStream(&StreaMPtr);

return 0;
}

Each thread has its own randomStream, then the random data generated are different between threads.

Thank you!

Qiang

JJK · ‎05-22-2015

Part of the performance difference is simple clockspeed - the Phi runs at ~1 GHz vs 2+ GHz for the CPU. That does not explain the factor of 40, however. I guess it has to do with the way the vdRngUniform is implemented for both the Xeon and the Xeon Phi.

I'm no OpenMP expert, but there are some superfluous #pragma openmp sections in there - it does not make a lot of sense to put openmp pragma's around stuff like "get_omp_*" . Use icc's -qopt-report flag to find out how well your code is parallellized.

Charles_C_Intel1 · ‎05-26-2015

In addition to using a slower clockspeed, the Xeon Phi is a simpler architecture than your Xeon, so *serial* performance can be on the order of 10x slower. That said, I can't tell if you are benchmarking the parallel code, or the entire program. You also are generating nthread random numbers at once using streams (so we better hope it doesn't do any locking) and then having every thread update ress at once (since you didn't put a #pragma omp for before the for(int j=0;j<N;j++) loop - race condition, and even with it you may get false sharing; plus tid isn't set to anything), which causes the same memory cells to get bounced back and forth between all 60 cores - severe cache thrash resulting. Xeon has fewer cores, so sees less effect from this.

So 40x slower isn't entirely unreasonable. :-)

Charles

Using MKL to generate random data on Xion Phi