Solved: Dear Mr. Dempsey,

Bo_W_3 · ‎02-10-2015

Hello Everyone,

i am doing a small test on Xeon Phi that calculates "pi" with "Calculate Pi Using an Infinite Series", see http://www.wikihow.com/Calculate-Pi . In my inplementation a small function is called in each iteration, i.e. lots of function calls. This function is declared for target. It suprises me why my program is so slowly.

And after I have inlined this function, it works much better, about 20 times...

I know function calls are expensive, however so expensive couldn't be.

Best Regards,

Bo

jimdempseyatthecove · ‎02-13-2015

loc-nguyen,

Do the following changes help?

__declspec(vector) double f(double a);

double f(double a)
{
    return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
    const double fH   = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;
    double factor = iRank + 0.5;
    double skip = iNumProcs;
    #pragma simd reduction(+:fSum)
    for (i = iRank; i < n; i += iNumProcs, factor += skip)
    {
        fX = fH * factor;
        fSum += f(fX);
 //fSum += 4.0 * (1.0 + fX * fX);
    }
    return fH * fSum;
}

or:

...
    {
        fSum += f(fH * factor);
 //fSum += 4.0 * (1.0 + fX * fX);
    }
...

To assist use of FMA

Jim Dempsey

View solution in original post

Loc_N_Intel · ‎02-10-2015

Hi Bo,

Using all the hardware threads available on the coprocessor and vectorizing the code when possible, you will improve the performance significantly. Also, depending on your approach, either offload model or running on coprocessor only, that will impact the performance too.

Charles_C_Intel1 · ‎02-10-2015

Could you share some sample code? If your loop is running on the host, and your little function is running on the coprocessor, then yes, you are spending all your time in communication for every iteration and it will run slowly. If the function inlines, then it is likely running entirely on the host (check OFFLOAD_DEBUG to be sure).

A better approach might be to offload the entire pi calculation, fire up an openmp loop on the coprocessor, and then call your functions there. Then the only communication you do is to start the calculation and return the result. This time will be shortened even more if you warm up the offload by doing a little offload and OpenMP before the offload for the pi calculation. This ensures that you aren't waiting for OpenMP to fire up 240 threads before it starts the computation, which will increase your timings.

Charles

Bo_W_3 · ‎02-11-2015

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>
#include <sched.h>

#ifdef OFFLOAD
__declspec(target(mic))  double CalcPi (int n, int iRank, int iNumProcs);
__declspec(target(mic)) double f(double a);
#else
double CalcPi (int n, int iRank, int iNumProcs);
#endif


int main(int argc, char **argv)
{
    int n = 200000000;
    int iMyRank, iNumProcs, nTimes, i;
    const double fPi25DT = 3.141592653589793238462643;
    double fPi = 0;
    double fTimeStart, fTimeEnd;
    int sv;

    iMyRank = 0;
    iNumProcs = 1;
    //nTimes = omp_get_max_threads();
    nTimes = 480;
    
    fTimeStart = omp_get_wtime();
    
    if (n <= 0 || n > 2147483647 ) 
    {
        printf("\ngiven value has to be between 0 and 2147483647\n");
        return 1;
    }

#ifdef OFFLOAD
	printf("before offload : %d \n", sched_getcpu());
    	#pragma offload target(mic:0) in(iMyRank, iNumProcs, n, i) signal(sv)
#endif
        //calculate pi multiple times
	#pragma omp parallel for reduction(+:fPi)
    for( i=0; i<nTimes; i++) {
    	fPi += CalcPi(n+i, iMyRank, iNumProcs);
	}

#ifdef OFFLOAD
	printf("offloaded : %d \n", sched_getcpu());
	#pragma offload_wait target(mic:0) wait(sv)
#endif
    fTimeEnd = omp_get_wtime();

    if (iMyRank == 0)
    {
        printf("\npi is approximately = %.20f \nError               = %.20f\n",
               fPi, fabs(fPi - fPi25DT));
        printf(  "wall clock time     = %.20f\n", fTimeEnd - fTimeStart);
    }
    return 0;
}


double f(double a)
{
    return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
    const double fH   = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;

    for (i = iRank; i < n; i += iNumProcs)
    {
        fX = fH * ((double)i + 0.5);
        fSum += f(fX);
	//fSum += 4.0 * (1.0 + fX * fX);
    }
    return fH * fSum;
}

Functions are decleared offload in lines 8 and 9, the calcPi(...) as well as the f(...) function....

These two different runnings can be seen in line 79, 80.

Actually, my code doesn't calculate pi. Whatever, you kown what I'm trying to do.

Loc_N_Intel · ‎02-12-2015

Hi Bo,

I couldn't reproduce the problem you see. When running on my system, the inline version improves running time from 2.1305 to 2.1256 as shown in the following:

First I compiled and ran your program:

# icc -DOFFLOAD -openmp offload-parallel.c -o offload.out

# ./offload.out

before offload : 16
offloaded : 16

pi is approximately = 2559.99999999999818101060
Error = 2556.85840734640851223958
wall clock time = 2.13059401512145996094

Then I modified the program to include the inline function, the new program is called offload-parallel-inline.c

# diff offload-parallel-inline.c offload-parallel.c
68c68
< inline double CalcPi (int n, int iRank, int iNumProcs)
---
> double CalcPi (int n, int iRank, int iNumProcs)

I compiled and ran the new program:

# icc -DOFFLOAD -openmp offload-parallel-inline.c -o offload-inline.out

# ./offload-inline.out

before offload : 1
offloaded : 2

pi is approximately = 2559.99999999999818101060
Error = 2556.85840734640851223958
wall clock time = 2.12561416625976562500

What MPSS version and compiler version are you using?

jimdempseyatthecove · ‎02-13-2015

loc-nguyen,

Do the following changes help?

__declspec(vector) double f(double a);

double f(double a)
{
    return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
    const double fH   = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;
    double factor = iRank + 0.5;
    double skip = iNumProcs;
    #pragma simd reduction(+:fSum)
    for (i = iRank; i < n; i += iNumProcs, factor += skip)
    {
        fX = fH * factor;
        fSum += f(fX);
 //fSum += 4.0 * (1.0 + fX * fX);
    }
    return fH * fSum;
}

or:

...
    {
        fSum += f(fH * factor);
 //fSum += 4.0 * (1.0 + fX * fX);
    }
...

To assist use of FMA

Jim Dempsey

Bo_W_3 · ‎02-13-2015

Dear Mr. Dempsey,

thanks for your answer, it works, especially with

__declspec(vector) double f(double a);

Now I get a performance improvement of 8 times. The vectorization works well.

#pragma simd reduction()

doesn't give too much. The Compiler has probably recognized that this loop can be vectorized, except the function f().

There is still a 2 times performance difference between inline and no-inline. It should be due to FMA in the function f(). Is there a not vectorized FMA instruction available? smiling...

However, i have made this simple thing complicated enough.

Thanks a lot for your almost 50 years experience! Respect!!!!

Thanks you all

Best Regards,

Bo Wang

jimdempseyatthecove wrote:

loc-nguyen,

Do the following changes help?

__declspec(vector) double f(double a);

double f(double a)
{
    return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
    const double fH   = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;
    double factor = iRank + 0.5;
    double skip = iNumProcs;
    #pragma simd reduction(+:fSum)
    for (i = iRank; i < n; i += iNumProcs, factor += skip)
    {
        fX = fH * factor;
        fSum += f(fX);
 //fSum += 4.0 * (1.0 + fX * fX);
    }
    return fH * fSum;
}

or:

...
    {
        fSum += f(fH * factor);
 //fSum += 4.0 * (1.0 + fX * fX);
    }
...

To assist use of FMA

Jim Dempsey

Poor Performance with function calls