- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I try to use MKL to generate lots of random data every time on Xeon phi, but the performance is very bad comparing the performance on Xeon CPU.(E5620) .
The attachment is the original code, and the compile option for Xeon Phi is -O3 -mkl -mmic. and it takes about 115 seconds, however when I run it on Xeon CPU,it only takes 3.5 seconds. I do not know why the difference is so much. Is the way in which I use the Xeon Phi wrong or the real performance on Xeon Phi is bad?
Thank you!
Qiang
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You would need parallel random number generators to get a useful comparison.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
In my tests, both are tested on one core using MKL. and how to use parallel random number generator? Because now I am using MKL with this function vdRngUniform(...),is MKL paralleled already? I think the MKL implementation make use of the vectorization already.
Thank you!
Qiang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
nope it does not - the MKL is parallel-ready, that is, its functions can be used in a parallel application, but *you* need to write the parallel application. a very dumb approach to your sample would be
#include <omp.h> #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mkl_vsl.h" #define ALIGN 64 #define RANDN 23*10000 int main (int argc, char *argv[]) { int nthreads, tid; int seed = 0; VSLStreamStatePtr Randomstream; vslNewStream(&Randomstream,VSL_BRNG_MCG31,0); __declspec(align(4096)) double l_Random[RANDN]; double t_start,t=0.0; double res=0; int timestep = 10000; t_start = omp_get_wtime(); #pragma omp parallel for for(int i=0;i<timestep;i++) { vdRngUniform(0, Randomstream, RANDN, l_Random, 0.0, 1.0-(1.2e-12)); for(int j=0;j<RANDN;j++) res+=l_Random; } t = omp_get_wtime() - t_start; printf("res = %lf with time consuming %lf \n",res,t); //_mm_free(l_Random); vslDeleteStream(&Randomstream); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@JJK,
Hi , Thank you for your reply. Sorry I should not put the time step loop in the code here, this code only simplify my complex application, in my real application the time step loop is dependent, it is not possible to use openmp over the time step loop. what I wonder is why the difference of the performance of vdRngUniform function on one core is so big between Xeon CPU and Xeon Phi without using openmp. It is almost 40 times slower on one core of Xeon Phi.
However, as you said, I can use openmp, I think if I use OpenMP to generate random data, I should use like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mkl_vsl.h"
#define ALIGN 64
int main (int argc, char *argv[])
{
int nthreads, tid;
#pragma omp parallel
{
nthreads = omp_get_num_threads();
}
double *ress = (double *)_mm_malloc(nthreads*sizeof(double),ALIGN);
VSLStreamStatePtr *StreaMPtr;
StreaMPtr = (VSLStreamStatePtr *)_mm_malloc(nthreads*sizeof(VSLStreamStatePtr),ALIGN);
for(int i=0;i<nthreads;i++)
vslNewStream(&StreaMPtr,VSL_BRNG_MCG31,i);
#pragma omp parallel
{
int tid = omp_get_thread_num();
int N = 1024;
double *rand = (double *)_mm_malloc(N*sizeof(double),ALIGN);
#pragma omp for
for(int i=0;i<nthreads;i++)
vdRngUniform(0, StreaMPtr, N, rand, 0.0, 1.0-(1.2e-12));
for(int j=0;j<N;j++)
ress[tid]+=rand
}
for(int k = 0;k<nthreads;k++)
printf("ress[%d] = %lf \n",k,ress
for(int i=0;i<nthreads;i++)
vslDeleteStream(&StreaMPtr);
return 0;
}
Each thread has its own randomStream, then the random data generated are different between threads.
Thank you!
Qiang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Part of the performance difference is simple clockspeed - the Phi runs at ~1 GHz vs 2+ GHz for the CPU. That does not explain the factor of 40, however. I guess it has to do with the way the vdRngUniform is implemented for both the Xeon and the Xeon Phi.
I'm no OpenMP expert, but there are some superfluous #pragma openmp sections in there - it does not make a lot of sense to put openmp pragma's around stuff like "get_omp_*" . Use icc's -qopt-report flag to find out how well your code is parallellized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to using a slower clockspeed, the Xeon Phi is a simpler architecture than your Xeon, so *serial* performance can be on the order of 10x slower. That said, I can't tell if you are benchmarking the parallel code, or the entire program. You also are generating nthread random numbers at once using streams (so we better hope it doesn't do any locking) and then having every thread update ress
So 40x slower isn't entirely unreasonable. :-)
Charles
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page