White paper: N-body simulation on Intel Xeon Phi Coprocessors

Andrey_Vladimirov · ‎01-10-2013

Hi All,

I am a rookie in the Intel Developer Zone, but I was fortunate to have early access to Intel Xeon Phi coprocessors for some time. Now that Intel has lifted the Xeon Phi coprocessor's veil, I am happy to share my user experience and demonstrate some examples.

I have written a white paper investigating a toy model, the N-body simulation, on an Intel Xeon Phi coprocessor. The PDF file can be downloaded here: http://research.colfaxinternational.com/file.axd?file=2013%2f1%2fColfax_Nbody_Xeon_Phi.pdf . The paper contains code samples, assembly listings and performance results, and discusses some programming practices for the efficient utilization of the coprocessor. Most importantly, it demonstrates that a single C/C++ code can efficiently run on Xeon processors and Xeon Phi coprocessors. Mind that the performance results are obtained on a pre-production sample, and the specifications and performance of the final product may be different.

If this paper has some utility to fellow developers and scientists, I will be thrilled. I am also looking forward to discussions on this forum and learning about the experience of other people with Xeon Phi programming.

Cheers,

Andrey.

TaylorIoTKidd · ‎01-11-2013

Andrey,

Thank you for contributing to the community.

Regards
--
Taylor

zhu_j_ · ‎02-10-2014

Hi Andrey

May I get your source code run on Xeon processors and Xeon Phi coprocessors.

Regards，

zhu.j

Bernard · ‎02-10-2014

@zhu.j

The part of the source code is in N-body simulation white paper.

@Andrey Vladimirov

Thank you for publishing your white paper about the N-body simulation.It helped me a lot to understand how to implement N-body simulation.

zhu_j_ · ‎02-10-2014

@iliyapolak

I hope to have the whole code and try to run it on my own platform。

thanks

Bernard · ‎02-11-2014

The part of the code which is missing are calls to particles initializers.

zhu_j_ · ‎02-11-2014

thanks

zhu_j_ · ‎02-11-2014

iliyapolak wrote:

The part of the code which is missing are calls to particles initializers.

thanks

zhu_j_ · ‎02-11-2014

iliyapolak wrote:

The part of the code which is missing are calls to particles initializers.

Thank you！

Bernard · ‎02-11-2014

I managed to add Windows multithreading now I am trying to properly synchronize the threads.I run it on Haswell CPU.

Andrey_Vladimirov · ‎02-12-2014

Hmm, I thought that we included a complete code listing in the paper. The particle initialization step uses a random number generator for MKL. It is not a very physically meaningful setup, just some numbers to crunch.

In any case, here is a code that can be run (sorry if spacing is off):

[cpp]

#include <math.h>
#include <mkl_vsl.h>
#include <omp.h>
#include <stdio.h>

struct ParticleSystemType {

float *x, *y, *z, *vx, *vy, *vz;

};

int main(const int argc, const char** argv) {
const int nParticles = 30000, nSteps = 10; // Simulation parameters
const float dt = 0.01f; // Particle propagation time step
ParticleSystemType p; // Particle system stored as a structure of arrays
float *buf = (float*) malloc(6*nParticles*sizeof(float)); // Malloc all data
p.x = buf+0*nParticles; p.y = buf+1*nParticles; p.z = buf+2*nParticles;
p.vx = buf+3*nParticles; p.vy = buf+4*nParticles; p.vz = buf+5*nParticles;

// Initialize particles
VSLStreamStatePtr rnStream; vslNewStream( &rnStream, VSL_BRNG_MT19937, 1 );
vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, 6*nParticles, buf, -1.0f, 1.0f);

// Propagate particles
printf("Propagating particles using %d threads...\n", omp_get_max_threads());
double rate = 0.0, dRate = 0.0; // Benchmarking data
const int skipSteps = 1; // Skip first iteration is warm-up on Xeon Phi coprocessor
for (int step = 1; step <= nSteps; step++) {
const double tStart = omp_get_wtime(); // Start timing

    #pragma omp parallel for schedule(dynamic)
    for (int i = 0; i < nParticles; i++) { // Parallel loop over particles that experience force
      float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; // Components of force on particle i

      for (int j = 0; j < nParticles; j++) { // Vectorized loop over particles that exert force
   if (j != i) {
      // Newton's law of universal gravity calculation.
      const float dx = p.x - p.x;
      const float dy = p.y - p.y;
      const float dz = p.z - p.z;
      const float drSquared    = dx*dx + dy*dy + dz*dz;
      const float drPowerN32   = 1.0f/(drSquared*sqrtf(drSquared));

      // Reduction to calculate the net force
      Fx += dx * drPowerN32; Fy += dy * drPowerN32; Fz += dz * drPowerN32;
   }
      }

      // Move particles in response to the gravitational force
      p.vx += dt*Fx;        p.vy += dt*Fy;        p.vz += dt*Fz;
    }

    for (int i = 0; i < nParticles; i++) { // Not much work, serial loop
      p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt;
    }

    const double tEnd = omp_get_wtime(); // End timing
    if (step > skipSteps) // Collect statistics
      { rate += 1.0/(tEnd - tStart); dRate += 1.0/((tEnd - tStart)*(tEnd-tStart)); }

printf("Step %d: %.3f seconds\n", step, (tEnd-tStart));
}

rate/=(double)(nSteps-skipSteps); dRate=sqrt(dRate/(double)(nSteps-skipSteps)-rate*rate);
printf("Average rate for iterations %d through %d: %.3f +- %.3f steps per second.\n",
skipSteps+1, nSteps, rate, dRate);
free(buf);
}

[/cpp]

The compilation commands on Linux are:

[bash]

icpc -xhost -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody nbody.c

icpc -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody-mic nbody.c -mmic

[/bash]

The flag "-fimf-domain-exclusion=8" disables denormal number handling in math functions.

danielpf · ‎05-05-2014

Hi!

For information with version 14 of Intel compiler the -fimf-domain-exclusion=8 option is no longer improving the code speed.

Thanks for the interesting comments on the code optimization.