- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am a rookie in the Intel Developer Zone, but I was fortunate to have early access to Intel Xeon Phi coprocessors for some time. Now that Intel has lifted the Xeon Phi coprocessor's veil, I am happy to share my user experience and demonstrate some examples.
I have written a white paper investigating a toy model, the N-body simulation, on an Intel Xeon Phi coprocessor. The PDF file can be downloaded here: http://research.colfaxinternational.com/file.axd?file=2013%2f1%2fColfax_Nbody_Xeon_Phi.pdf . The paper contains code samples, assembly listings and performance results, and discusses some programming practices for the efficient utilization of the coprocessor. Most importantly, it demonstrates that a single C/C++ code can efficiently run on Xeon processors and Xeon Phi coprocessors. Mind that the performance results are obtained on a pre-production sample, and the specifications and performance of the final product may be different.
If this paper has some utility to fellow developers and scientists, I will be thrilled. I am also looking forward to discussions on this forum and learning about the experience of other people with Xeon Phi programming.
Cheers,
Andrey.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey,
Thank you for contributing to the community.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey
May I get your source code run on Xeon processors and Xeon Phi coprocessors.
Regards,
zhu.j
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@zhu.j
The part of the source code is in N-body simulation white paper.
@Andrey Vladimirov
Thank you for publishing your white paper about the N-body simulation.It helped me a lot to understand how to implement N-body simulation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The part of the code which is missing are calls to particles initializers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
The part of the code which is missing are calls to particles initializers.
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
The part of the code which is missing are calls to particles initializers.
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I managed to add Windows multithreading now I am trying to properly synchronize the threads.I run it on Haswell CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm, I thought that we included a complete code listing in the paper. The particle initialization step uses a random number generator for MKL. It is not a very physically meaningful setup, just some numbers to crunch.
In any case, here is a code that can be run (sorry if spacing is off):
[cpp]
#include <math.h>
#include <mkl_vsl.h>
#include <omp.h>
#include <stdio.h>
struct ParticleSystemType {
float *x, *y, *z, *vx, *vy, *vz;
};
int main(const int argc, const char** argv) {
const int nParticles = 30000, nSteps = 10; // Simulation parameters
const float dt = 0.01f; // Particle propagation time step
ParticleSystemType p; // Particle system stored as a structure of arrays
float *buf = (float*) malloc(6*nParticles*sizeof(float)); // Malloc all data
p.x = buf+0*nParticles; p.y = buf+1*nParticles; p.z = buf+2*nParticles;
p.vx = buf+3*nParticles; p.vy = buf+4*nParticles; p.vz = buf+5*nParticles;
// Initialize particles
VSLStreamStatePtr rnStream; vslNewStream( &rnStream, VSL_BRNG_MT19937, 1 );
vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, 6*nParticles, buf, -1.0f, 1.0f);
// Propagate particles
printf("Propagating particles using %d threads...\n", omp_get_max_threads());
double rate = 0.0, dRate = 0.0; // Benchmarking data
const int skipSteps = 1; // Skip first iteration is warm-up on Xeon Phi coprocessor
for (int step = 1; step <= nSteps; step++) {
const double tStart = omp_get_wtime(); // Start timing
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < nParticles; i++) { // Parallel loop over particles that experience force
float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; // Components of force on particle i
for (int j = 0; j < nParticles; j++) { // Vectorized loop over particles that exert force
if (j != i) {
// Newton's law of universal gravity calculation.
const float dx = p.x
const float dy = p.y
const float dz = p.z
const float drSquared = dx*dx + dy*dy + dz*dz;
const float drPowerN32 = 1.0f/(drSquared*sqrtf(drSquared));
// Reduction to calculate the net force
Fx += dx * drPowerN32; Fy += dy * drPowerN32; Fz += dz * drPowerN32;
}
}
// Move particles in response to the gravitational force
p.vx += dt*Fx; p.vy += dt*Fy; p.vz += dt*Fz;
}
for (int i = 0; i < nParticles; i++) { // Not much work, serial loop
p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt;
}
const double tEnd = omp_get_wtime(); // End timing
if (step > skipSteps) // Collect statistics
{ rate += 1.0/(tEnd - tStart); dRate += 1.0/((tEnd - tStart)*(tEnd-tStart)); }
printf("Step %d: %.3f seconds\n", step, (tEnd-tStart));
}
rate/=(double)(nSteps-skipSteps); dRate=sqrt(dRate/(double)(nSteps-skipSteps)-rate*rate);
printf("Average rate for iterations %d through %d: %.3f +- %.3f steps per second.\n",
skipSteps+1, nSteps, rate, dRate);
free(buf);
}
[/cpp]
The compilation commands on Linux are:
[bash]
icpc -xhost -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody nbody.c
icpc -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody-mic nbody.c -mmic
[/bash]
The flag "-fimf-domain-exclusion=8" disables denormal number handling in math functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
For information w
ith version 14 of Intel compiler
the -fimf-domain-exclusion=8 option is no longer improving the code speed.
Thanks for the interesting comments on the code optimization.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page