<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Dear Mr. Dempsey, in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005859#M32102</link>
    <description>&lt;P&gt;Dear Mr. Dempsey,&lt;/P&gt;

&lt;P&gt;thanks for your answer, it works, especially with&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__declspec(vector) double f(double a);&lt;/PRE&gt;

&lt;P&gt;Now I get a performance improvement of 8 times.&amp;nbsp;&amp;nbsp; The vectorization works well.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#pragma simd reduction()&lt;/PRE&gt;

&lt;P&gt;doesn't give too much. The Compiler has probably recognized that this loop can be vectorized, except the function f().&lt;/P&gt;

&lt;P&gt;There is still a 2 times performance difference between inline and no-inline. It should be due to FMA in the function f().&amp;nbsp; Is there a not vectorized FMA instruction available? &amp;nbsp; &amp;nbsp; smiling...&lt;/P&gt;

&lt;P&gt;However, i have made this simple thing&amp;nbsp; complicated enough.&lt;/P&gt;

&lt;P&gt;Thanks a lot for your almost 50 years experience! Respect!!!!&lt;/P&gt;

&lt;P&gt;Thanks you all&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Bo Wang&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;loc-nguyen,&lt;/P&gt;

&lt;P&gt;Do the following changes help?&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__declspec(vector) double f(double a);

double f(double a)
{
&amp;nbsp;&amp;nbsp;&amp;nbsp; return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
&amp;nbsp;&amp;nbsp;&amp;nbsp; const double fH&amp;nbsp;&amp;nbsp; = 1.0 / (double) n;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double fSum = 0.0;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double fX;
&amp;nbsp;&amp;nbsp;&amp;nbsp; int i;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double factor = iRank + 0.5;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double skip = iNumProcs;
&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma simd reduction(+:fSum)
&amp;nbsp;&amp;nbsp;&amp;nbsp; for (i = iRank; i &amp;lt; n; i += iNumProcs, factor += skip)
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fX = fH * factor;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fSum += f(fX);
&amp;nbsp;//fSum += 4.0 * (1.0 + fX * fX);
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp; return fH * fSum;
}
&lt;/PRE&gt;

&lt;P&gt;or:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;...
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fSum += f(fH * factor);
&amp;nbsp;//fSum += 4.0 * (1.0 + fX * fX);
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
...&lt;/PRE&gt;

&lt;P&gt;To assist use of FMA&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 13 Feb 2015 19:06:06 GMT</pubDate>
    <dc:creator>Bo_W_3</dc:creator>
    <dc:date>2015-02-13T19:06:06Z</dc:date>
    <item>
      <title>Poor Performance with function calls</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005853#M32096</link>
      <description>&lt;P&gt;Hello Everyone,&lt;/P&gt;

&lt;P&gt;i am doing a small test on Xeon Phi that calculates "pi" with "&lt;SPAN class="mw-headline" id="Calculate_Pi_Using_an_Infinite_Series"&gt;Calculate Pi Using an Infinite Series", see &lt;/SPAN&gt;&lt;A href="http://www.wikihow.com/Calculate-Pi" target="_blank"&gt;http://www.wikihow.com/Calculate-Pi&lt;/A&gt; . In my inplementation a small function is called in each iteration, i.e. lots of function calls.&amp;nbsp; This function is declared for target. It suprises me why my program is so slowly.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;And after I have inlined this function, it works much better, about 20 times...&lt;/P&gt;

&lt;P&gt;I know function calls are expensive, however so expensive couldn't be.&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Bo&lt;/P&gt;</description>
      <pubDate>Tue, 10 Feb 2015 14:51:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005853#M32096</guid>
      <dc:creator>Bo_W_3</dc:creator>
      <dc:date>2015-02-10T14:51:06Z</dc:date>
    </item>
    <item>
      <title>Hi Bo,</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005854#M32097</link>
      <description>&lt;P&gt;Hi Bo,&lt;/P&gt;

&lt;P&gt;Using all the hardware threads available on the coprocessor&amp;nbsp;and vectorizing&amp;nbsp;the code when possible, you will improve&amp;nbsp;the performance significantly.&amp;nbsp;Also, depending on your approach, either offload model or running&amp;nbsp;on&amp;nbsp;coprocessor only, that will impact the performance too.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Feb 2015 17:44:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005854#M32097</guid>
      <dc:creator>Loc_N_Intel</dc:creator>
      <dc:date>2015-02-10T17:44:08Z</dc:date>
    </item>
    <item>
      <title>Could you share some sample</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005855#M32098</link>
      <description>&lt;P&gt;Could you share some sample code?&amp;nbsp; If your loop is running on the host, and your little function is running on the coprocessor, then yes, you are spending all your time in communication for every iteration and it will run slowly.&amp;nbsp;&amp;nbsp; If the function inlines, then it is likely running entirely on the host (check OFFLOAD_DEBUG to be sure).&lt;/P&gt;
&lt;P&gt;A better approach might be to offload the entire pi calculation, fire up an openmp loop on the coprocessor, and then call your functions there.&amp;nbsp; Then the only communication you do is to start the calculation and return the result.&amp;nbsp;&amp;nbsp; This time will be shortened even more if you warm up the offload by doing a little offload and OpenMP before the offload for the pi calculation.&amp;nbsp; This&amp;nbsp;ensures that you aren't waiting for OpenMP to fire up 240 threads before it starts the computation, which will increase your timings.&lt;/P&gt;
&lt;P&gt;Charles&lt;/P&gt;</description>
      <pubDate>Tue, 10 Feb 2015 20:35:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005855#M32098</guid>
      <dc:creator>Charles_C_Intel1</dc:creator>
      <dc:date>2015-02-10T20:35:37Z</dc:date>
    </item>
    <item>
      <title>#include &lt;stdio.h&gt;</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005856#M32099</link>
      <description>&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;math.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;sched.h&amp;gt;

#ifdef OFFLOAD
__declspec(target(mic))  double CalcPi (int n, int iRank, int iNumProcs);
__declspec(target(mic)) double f(double a);
#else
double CalcPi (int n, int iRank, int iNumProcs);
#endif


int main(int argc, char **argv)
{
    int n = 200000000;
    int iMyRank, iNumProcs, nTimes, i;
    const double fPi25DT = 3.141592653589793238462643;
    double fPi = 0;
    double fTimeStart, fTimeEnd;
    int sv;

    iMyRank = 0;
    iNumProcs = 1;
    //nTimes = omp_get_max_threads();
    nTimes = 480;
    
    fTimeStart = omp_get_wtime();
    
    if (n &amp;lt;= 0 || n &amp;gt; 2147483647 ) 
    {
        printf("\ngiven value has to be between 0 and 2147483647\n");
        return 1;
    }

#ifdef OFFLOAD
	printf("before offload : %d \n", sched_getcpu());
    	#pragma offload target(mic:0) in(iMyRank, iNumProcs, n, i) signal(sv)
#endif
        //calculate pi multiple times
	#pragma omp parallel for reduction(+:fPi)
    for( i=0; i&amp;lt;nTimes; i++) {
    	fPi += CalcPi(n+i, iMyRank, iNumProcs);
	}

#ifdef OFFLOAD
	printf("offloaded : %d \n", sched_getcpu());
	#pragma offload_wait target(mic:0) wait(sv)
#endif
    fTimeEnd = omp_get_wtime();

    if (iMyRank == 0)
    {
        printf("\npi is approximately = %.20f \nError               = %.20f\n",
               fPi, fabs(fPi - fPi25DT));
        printf(  "wall clock time     = %.20f\n", fTimeEnd - fTimeStart);
    }
    return 0;
}


double f(double a)
{
    return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
    const double fH   = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;

    for (i = iRank; i &amp;lt; n; i += iNumProcs)
    {
        fX = fH * ((double)i + 0.5);
        fSum += f(fX);
	//fSum += 4.0 * (1.0 + fX * fX);
    }
    return fH * fSum;
}
&lt;/PRE&gt;

&lt;P&gt;Functions are&amp;nbsp; decleared offload in lines 8 and 9, the calcPi(...)&amp;nbsp; as well as the f(...) function....&lt;/P&gt;

&lt;P&gt;These two different runnings can be seen in line 79, 80.&lt;/P&gt;

&lt;P&gt;Actually, my code doesn't calculate pi. Whatever, you kown what I'm trying to do.&lt;/P&gt;</description>
      <pubDate>Wed, 11 Feb 2015 08:08:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005856#M32099</guid>
      <dc:creator>Bo_W_3</dc:creator>
      <dc:date>2015-02-11T08:08:00Z</dc:date>
    </item>
    <item>
      <title>Hi Bo,</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005857#M32100</link>
      <description>&lt;P&gt;Hi Bo,&lt;/P&gt;

&lt;P&gt;I couldn't reproduce the problem you see. When running on my system, the inline version improves running time from 2.1305 to 2.1256 as shown in the following:&lt;/P&gt;

&lt;P&gt;First I compiled and ran your program:&lt;/P&gt;

&lt;P&gt;# icc -DOFFLOAD -openmp offload-parallel.c -o offload.out&lt;/P&gt;

&lt;P&gt;# ./offload.out&lt;/P&gt;

&lt;P&gt;before offload : 16&lt;BR /&gt;
	offloaded : 16&lt;/P&gt;

&lt;P&gt;pi is approximately = 2559.99999999999818101060&lt;BR /&gt;
	Error&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; = 2556.85840734640851223958&lt;BR /&gt;
	wall clock time&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; = 2.13059401512145996094&lt;/P&gt;

&lt;P&gt;Then I modified the program to&amp;nbsp;include the inline function, the new program is called offload-parallel-inline.c&lt;/P&gt;

&lt;P&gt;# diff offload-parallel-inline.c offload-parallel.c&lt;BR /&gt;
	68c68&lt;BR /&gt;
	&amp;lt; inline double CalcPi (int n, int iRank, int iNumProcs)&lt;BR /&gt;
	---&lt;BR /&gt;
	&amp;gt; double CalcPi (int n, int iRank, int iNumProcs)&lt;/P&gt;

&lt;P&gt;I compiled and ran the new program:&lt;/P&gt;

&lt;P&gt;# icc -DOFFLOAD -openmp offload-parallel-inline.c -o offload-inline.out&lt;/P&gt;

&lt;P&gt;# ./offload-inline.out&lt;/P&gt;

&lt;P&gt;before offload : 1&lt;BR /&gt;
	offloaded : 2&lt;/P&gt;

&lt;P&gt;pi is approximately = 2559.99999999999818101060&lt;BR /&gt;
	Error&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; = 2556.85840734640851223958&lt;BR /&gt;
	wall clock time&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; = 2.12561416625976562500&lt;/P&gt;

&lt;P&gt;What MPSS version and compiler version are you using?&lt;/P&gt;</description>
      <pubDate>Fri, 13 Feb 2015 01:16:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005857#M32100</guid>
      <dc:creator>Loc_N_Intel</dc:creator>
      <dc:date>2015-02-13T01:16:19Z</dc:date>
    </item>
    <item>
      <title>loc-nguyen,</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005858#M32101</link>
      <description>&lt;P&gt;loc-nguyen,&lt;/P&gt;

&lt;P&gt;Do the following changes help?&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__declspec(vector) double f(double a);

double f(double a)
{
&amp;nbsp;&amp;nbsp;&amp;nbsp; return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
&amp;nbsp;&amp;nbsp;&amp;nbsp; const double fH&amp;nbsp;&amp;nbsp; = 1.0 / (double) n;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double fSum = 0.0;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double fX;
&amp;nbsp;&amp;nbsp;&amp;nbsp; int i;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double factor = iRank + 0.5;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double skip = iNumProcs;
&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma simd reduction(+:fSum)
&amp;nbsp;&amp;nbsp;&amp;nbsp; for (i = iRank; i &amp;lt; n; i += iNumProcs, factor += skip)
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fX = fH * factor;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fSum += f(fX);
&amp;nbsp;//fSum += 4.0 * (1.0 + fX * fX);
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp; return fH * fSum;
}
&lt;/PRE&gt;

&lt;P&gt;or:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;...
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fSum += f(fH * factor);
&amp;nbsp;//fSum += 4.0 * (1.0 + fX * fX);
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
...&lt;/PRE&gt;

&lt;P&gt;To assist use of FMA&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 13 Feb 2015 17:38:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005858#M32101</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2015-02-13T17:38:00Z</dc:date>
    </item>
    <item>
      <title>Dear Mr. Dempsey,</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005859#M32102</link>
      <description>&lt;P&gt;Dear Mr. Dempsey,&lt;/P&gt;

&lt;P&gt;thanks for your answer, it works, especially with&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__declspec(vector) double f(double a);&lt;/PRE&gt;

&lt;P&gt;Now I get a performance improvement of 8 times.&amp;nbsp;&amp;nbsp; The vectorization works well.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#pragma simd reduction()&lt;/PRE&gt;

&lt;P&gt;doesn't give too much. The Compiler has probably recognized that this loop can be vectorized, except the function f().&lt;/P&gt;

&lt;P&gt;There is still a 2 times performance difference between inline and no-inline. It should be due to FMA in the function f().&amp;nbsp; Is there a not vectorized FMA instruction available? &amp;nbsp; &amp;nbsp; smiling...&lt;/P&gt;

&lt;P&gt;However, i have made this simple thing&amp;nbsp; complicated enough.&lt;/P&gt;

&lt;P&gt;Thanks a lot for your almost 50 years experience! Respect!!!!&lt;/P&gt;

&lt;P&gt;Thanks you all&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Bo Wang&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;loc-nguyen,&lt;/P&gt;

&lt;P&gt;Do the following changes help?&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__declspec(vector) double f(double a);

double f(double a)
{
&amp;nbsp;&amp;nbsp;&amp;nbsp; return (4.0 * (1.0 + a*a));
}

double CalcPi (int n, int iRank, int iNumProcs)
{
&amp;nbsp;&amp;nbsp;&amp;nbsp; const double fH&amp;nbsp;&amp;nbsp; = 1.0 / (double) n;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double fSum = 0.0;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double fX;
&amp;nbsp;&amp;nbsp;&amp;nbsp; int i;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double factor = iRank + 0.5;
&amp;nbsp;&amp;nbsp;&amp;nbsp; double skip = iNumProcs;
&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma simd reduction(+:fSum)
&amp;nbsp;&amp;nbsp;&amp;nbsp; for (i = iRank; i &amp;lt; n; i += iNumProcs, factor += skip)
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fX = fH * factor;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fSum += f(fX);
&amp;nbsp;//fSum += 4.0 * (1.0 + fX * fX);
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp;&amp;nbsp;&amp;nbsp; return fH * fSum;
}
&lt;/PRE&gt;

&lt;P&gt;or:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;...
&amp;nbsp;&amp;nbsp;&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fSum += f(fH * factor);
&amp;nbsp;//fSum += 4.0 * (1.0 + fX * fX);
&amp;nbsp;&amp;nbsp;&amp;nbsp; }
...&lt;/PRE&gt;

&lt;P&gt;To assist use of FMA&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Feb 2015 19:06:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-Performance-with-function-calls/m-p/1005859#M32102</guid>
      <dc:creator>Bo_W_3</dc:creator>
      <dc:date>2015-02-13T19:06:06Z</dc:date>
    </item>
  </channel>
</rss>

