- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Everyone,
i am doing a small test on Xeon Phi that calculates "pi" with "Calculate Pi Using an Infinite Series", see http://www.wikihow.com/Calculate-Pi . In my inplementation a small function is called in each iteration, i.e. lots of function calls. This function is declared for target. It suprises me why my program is so slowly.
And after I have inlined this function, it works much better, about 20 times...
I know function calls are expensive, however so expensive couldn't be.
Best Regards,
Bo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
loc-nguyen,
Do the following changes help?
__declspec(vector) double f(double a); double f(double a) { return (4.0 * (1.0 + a*a)); } double CalcPi (int n, int iRank, int iNumProcs) { const double fH = 1.0 / (double) n; double fSum = 0.0; double fX; int i; double factor = iRank + 0.5; double skip = iNumProcs; #pragma simd reduction(+:fSum) for (i = iRank; i < n; i += iNumProcs, factor += skip) { fX = fH * factor; fSum += f(fX); //fSum += 4.0 * (1.0 + fX * fX); } return fH * fSum; }
or:
... { fSum += f(fH * factor); //fSum += 4.0 * (1.0 + fX * fX); } ...
To assist use of FMA
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bo,
Using all the hardware threads available on the coprocessor and vectorizing the code when possible, you will improve the performance significantly. Also, depending on your approach, either offload model or running on coprocessor only, that will impact the performance too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you share some sample code? If your loop is running on the host, and your little function is running on the coprocessor, then yes, you are spending all your time in communication for every iteration and it will run slowly. If the function inlines, then it is likely running entirely on the host (check OFFLOAD_DEBUG to be sure).
A better approach might be to offload the entire pi calculation, fire up an openmp loop on the coprocessor, and then call your functions there. Then the only communication you do is to start the calculation and return the result. This time will be shortened even more if you warm up the offload by doing a little offload and OpenMP before the offload for the pi calculation. This ensures that you aren't waiting for OpenMP to fire up 240 threads before it starts the computation, which will increase your timings.
Charles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#include <stdio.h> #include <math.h> #include <stdlib.h> #include <omp.h> #include <sched.h> #ifdef OFFLOAD __declspec(target(mic)) double CalcPi (int n, int iRank, int iNumProcs); __declspec(target(mic)) double f(double a); #else double CalcPi (int n, int iRank, int iNumProcs); #endif int main(int argc, char **argv) { int n = 200000000; int iMyRank, iNumProcs, nTimes, i; const double fPi25DT = 3.141592653589793238462643; double fPi = 0; double fTimeStart, fTimeEnd; int sv; iMyRank = 0; iNumProcs = 1; //nTimes = omp_get_max_threads(); nTimes = 480; fTimeStart = omp_get_wtime(); if (n <= 0 || n > 2147483647 ) { printf("\ngiven value has to be between 0 and 2147483647\n"); return 1; } #ifdef OFFLOAD printf("before offload : %d \n", sched_getcpu()); #pragma offload target(mic:0) in(iMyRank, iNumProcs, n, i) signal(sv) #endif //calculate pi multiple times #pragma omp parallel for reduction(+:fPi) for( i=0; i<nTimes; i++) { fPi += CalcPi(n+i, iMyRank, iNumProcs); } #ifdef OFFLOAD printf("offloaded : %d \n", sched_getcpu()); #pragma offload_wait target(mic:0) wait(sv) #endif fTimeEnd = omp_get_wtime(); if (iMyRank == 0) { printf("\npi is approximately = %.20f \nError = %.20f\n", fPi, fabs(fPi - fPi25DT)); printf( "wall clock time = %.20f\n", fTimeEnd - fTimeStart); } return 0; } double f(double a) { return (4.0 * (1.0 + a*a)); } double CalcPi (int n, int iRank, int iNumProcs) { const double fH = 1.0 / (double) n; double fSum = 0.0; double fX; int i; for (i = iRank; i < n; i += iNumProcs) { fX = fH * ((double)i + 0.5); fSum += f(fX); //fSum += 4.0 * (1.0 + fX * fX); } return fH * fSum; }
Functions are decleared offload in lines 8 and 9, the calcPi(...) as well as the f(...) function....
These two different runnings can be seen in line 79, 80.
Actually, my code doesn't calculate pi. Whatever, you kown what I'm trying to do.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bo,
I couldn't reproduce the problem you see. When running on my system, the inline version improves running time from 2.1305 to 2.1256 as shown in the following:
First I compiled and ran your program:
# icc -DOFFLOAD -openmp offload-parallel.c -o offload.out
# ./offload.out
before offload : 16
offloaded : 16
pi is approximately = 2559.99999999999818101060
Error = 2556.85840734640851223958
wall clock time = 2.13059401512145996094
Then I modified the program to include the inline function, the new program is called offload-parallel-inline.c
# diff offload-parallel-inline.c offload-parallel.c
68c68
< inline double CalcPi (int n, int iRank, int iNumProcs)
---
> double CalcPi (int n, int iRank, int iNumProcs)
I compiled and ran the new program:
# icc -DOFFLOAD -openmp offload-parallel-inline.c -o offload-inline.out
# ./offload-inline.out
before offload : 1
offloaded : 2
pi is approximately = 2559.99999999999818101060
Error = 2556.85840734640851223958
wall clock time = 2.12561416625976562500
What MPSS version and compiler version are you using?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
loc-nguyen,
Do the following changes help?
__declspec(vector) double f(double a); double f(double a) { return (4.0 * (1.0 + a*a)); } double CalcPi (int n, int iRank, int iNumProcs) { const double fH = 1.0 / (double) n; double fSum = 0.0; double fX; int i; double factor = iRank + 0.5; double skip = iNumProcs; #pragma simd reduction(+:fSum) for (i = iRank; i < n; i += iNumProcs, factor += skip) { fX = fH * factor; fSum += f(fX); //fSum += 4.0 * (1.0 + fX * fX); } return fH * fSum; }
or:
... { fSum += f(fH * factor); //fSum += 4.0 * (1.0 + fX * fX); } ...
To assist use of FMA
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mr. Dempsey,
thanks for your answer, it works, especially with
__declspec(vector) double f(double a);
Now I get a performance improvement of 8 times. The vectorization works well.
#pragma simd reduction()
doesn't give too much. The Compiler has probably recognized that this loop can be vectorized, except the function f().
There is still a 2 times performance difference between inline and no-inline. It should be due to FMA in the function f(). Is there a not vectorized FMA instruction available? smiling...
However, i have made this simple thing complicated enough.
Thanks a lot for your almost 50 years experience! Respect!!!!
Thanks you all
Best Regards,
Bo Wang
jimdempseyatthecove wrote:
loc-nguyen,
Do the following changes help?
__declspec(vector) double f(double a); double f(double a) { return (4.0 * (1.0 + a*a)); } double CalcPi (int n, int iRank, int iNumProcs) { const double fH = 1.0 / (double) n; double fSum = 0.0; double fX; int i; double factor = iRank + 0.5; double skip = iNumProcs; #pragma simd reduction(+:fSum) for (i = iRank; i < n; i += iNumProcs, factor += skip) { fX = fH * factor; fSum += f(fX); //fSum += 4.0 * (1.0 + fX * fX); } return fH * fSum; }or:
... { fSum += f(fH * factor); //fSum += 4.0 * (1.0 + fX * fX); } ...To assist use of FMA
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page