<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Fiori,  in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122283#M25036</link>
    <description>&lt;P&gt;Hi Fiori,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Do you have other intel software installed on your developer machine, like intel C/C++ compiler &amp;nbsp;or Intel Integrate performance Primitive (Intel IPP)?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If with Intel Compiler, &amp;nbsp;you can build your original code with Intel C/C++ compiler, &amp;nbsp;which should be able to speed up the sum code automatically (you don't need to rewrite the original code).&lt;/P&gt;

&lt;P&gt;If with Intel IPP , you can &amp;nbsp;call IPP function (like MKL function)&lt;/P&gt;

&lt;P&gt;ippsSum_32f(const Ipp32f* pSrc, int len, Ipp32f* pSum, IppHintAlgorithm&amp;nbsp;hint);&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;BR /&gt;
	Ying&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Example&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The example below shows how to use the function ippsSum.&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	void sum(void) {&lt;BR /&gt;
	Ipp16s x[4] = {-32768, 32767, 32767, 32767}, sm;&lt;BR /&gt;
	ippsSum_16s_Sfs(x, 4, &amp;amp;sm, 1);&lt;BR /&gt;
	printf_16s(“sum =”, &amp;amp;sm, 1, ippStsNoErr);&lt;BR /&gt;
	}&lt;BR /&gt;
	Output:&lt;BR /&gt;
	sum = 32766&lt;BR /&gt;
	Matlab* Analog:&lt;BR /&gt;
	&amp;gt;&amp;gt; x = [-32768, 32767, 32767, 32767]; sum(x)/2&lt;/P&gt;</description>
    <pubDate>Mon, 26 Dec 2016 08:53:21 GMT</pubDate>
    <dc:creator>Ying_H_Intel</dc:creator>
    <dc:date>2016-12-26T08:53:21Z</dc:date>
    <item>
      <title>Better way to sum the elements of a vector?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122278#M25031</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;

&lt;P&gt;I want to sum the elements of a vector y. In order to do that I do:&lt;/P&gt;

&lt;P&gt;1) create a vector x(i)=1, for all i&lt;/P&gt;

&lt;P&gt;2) use the function&amp;nbsp;cblas_?dot&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="kwd" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;cblas_ddot&lt;/SPAN&gt;&lt;SPAN class="delim" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;(&lt;/SPAN&gt;&lt;SPAN class="var" style="box-sizing: border-box; font-style: italic; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;n&lt;/SPAN&gt;&lt;SPAN class="sep" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;,&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 13px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="var" style="box-sizing: border-box; font-style: italic; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;x&lt;/SPAN&gt;&lt;SPAN class="sep" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;,&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 13px;"&gt;&amp;nbsp;1&lt;/SPAN&gt;&lt;SPAN class="sep" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;,&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 13px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="var" style="box-sizing: border-box; font-style: italic; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;y&lt;/SPAN&gt;&lt;SPAN class="sep" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;,&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 13px;"&gt;&amp;nbsp;1&lt;/SPAN&gt;&lt;SPAN class="delim" style="box-sizing: border-box; font-family: &amp;quot;Courier New&amp;quot;, Courier, monospace; color: rgb(102, 102, 102); font-size: 13px;"&gt;)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I would like to ask if there is a beter way to do this.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thank you very much.&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 10:28:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122278#M25031</guid>
      <dc:creator>Fiori</dc:creator>
      <dc:date>2016-12-24T10:28:34Z</dc:date>
    </item>
    <item>
      <title>The primary alternative to</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122279#M25032</link>
      <description>&lt;P&gt;The primary alternative to MKL BLAS for simd optimization of sum reduction is your compiler's auto-vectorization.&amp;nbsp; Unfortunately, the actions of gcc vs. icc are slightly different, which may deter you from considering them as "best."&amp;nbsp; MSVC++ will not perform simd optimized reduction, so if your unstated ground rules involve that, you may consider the BLAS to be "best."&lt;/P&gt;

&lt;P&gt;MKL ?dot will switch over automatically to combined simd and threaded optimization at some large operand length (&amp;gt; 4000 ?), and may select a simd variant automatically at run time. The additional overhead may be significant if your operand is of moderate length (&amp;lt; 400 ?). If you use compiler auto-vectorization and wish such a combination, you may need to write it out in nested loops.&lt;/P&gt;

&lt;P&gt;Intel compilers implement both OpenMP simd reduction by #pragma omp simd reduction(+: .....) and parallel threaded reduction by #pragma omp parallel reduction(+: ...).&amp;nbsp;&amp;nbsp; If the code is in the form required by omp simd reduction, the optimization should occur anyway at default compiler flags (preferably with appropriate when the pragma is omitted.&amp;nbsp; gcc should perform the simd optimization without pragma omp when -ffast-math -O3 and suitable -march is set (and will not perform it without -ffast-math even under pragma omp simd reduction), but that can't be recommended without qualification.&lt;/P&gt;

&lt;P&gt;gcc will not "riffle" the sum reduction for best performance of a single large sum, but compiler simd optimization is likely to perform better than multiple ?dot function calls.&lt;/P&gt;

&lt;P&gt;Combinations of operations bring in further considerations on what may be the best choice.&lt;/P&gt;

&lt;P&gt;Sorry no one has guessed your parameters well enough to give a simple answer.&lt;/P&gt;

&lt;P&gt;If you are summing a vector of all 1's, obviously dot is not the efficient way.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 12:30:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122279#M25032</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-12-24T12:30:00Z</dc:date>
    </item>
    <item>
      <title>Thank you for your help, but</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122280#M25033</link>
      <description>&lt;P&gt;Thank you for your help, but to be honest I haven't understood the answer. I would appreciate if you could give me a simpler one.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;At some point of my code I want to compute the sum of &amp;nbsp;B where B has elements from 5e+3 to 1e+5. I have searched for an appropriate function "sum" but I have found only the function&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;cblas_ddot. So, I have assumed that a way to do that is to create a vector of "ones" and then c&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 13px;"&gt;ompute a vector vector dot product. For example:&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;niters = 1e+6;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;for(I=1;I&amp;lt;=niters;I++)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;{&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; do some steps ...&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; sumLog = 0.0;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; for (i = 0; i&amp;lt;nDays; i++)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; {&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; inda = cuma&lt;I&gt;;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cblas_daxpy(DailySize&lt;I&gt;, -phi1, &amp;amp;a[inda], 1, &amp;nbsp;&amp;amp;B[inda+1], 1);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; partial = (1.0/sigma2)*(1.0-phi1*phi1)*a[inda] -&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; (1.0/sigma2)*(phi1 - 1.0)*cblas_ddot(DailySize&lt;I&gt;,&amp;amp;B[inda+1],1,&amp;amp;ones[inda+1],1);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sumLog = sumLog + partial;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; }&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;I&lt;/P&gt;

&lt;P&gt;some other steps&lt;/P&gt;

&lt;P&gt;} //&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 13:21:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122280#M25033</guid>
      <dc:creator>Fiori</dc:creator>
      <dc:date>2016-12-24T13:21:00Z</dc:date>
    </item>
    <item>
      <title>How about the functions cblas</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122281#M25034</link>
      <description>&lt;P&gt;How about the functions cblas_dasum(), cblas_sasum(), etc.?&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 15:28:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122281#M25034</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2016-12-24T15:28:17Z</dc:date>
    </item>
    <item>
      <title>cblas_dasum(), cblas_sasum()</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122282#M25035</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;cblas_dasum(), cblas_sasum() &amp;nbsp;don't work for me because they c&lt;/SPAN&gt;&lt;SPAN style="color: rgb(84, 84, 84); font-family: arial, sans-serif; font-size: small;"&gt;ompute the sum of the absolute values of elements. But my vector have both positive and negative values.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thank you for your reply.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 17:14:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122282#M25035</guid>
      <dc:creator>Fiori</dc:creator>
      <dc:date>2016-12-24T17:14:44Z</dc:date>
    </item>
    <item>
      <title>Hi Fiori, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122283#M25036</link>
      <description>&lt;P&gt;Hi Fiori,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Do you have other intel software installed on your developer machine, like intel C/C++ compiler &amp;nbsp;or Intel Integrate performance Primitive (Intel IPP)?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If with Intel Compiler, &amp;nbsp;you can build your original code with Intel C/C++ compiler, &amp;nbsp;which should be able to speed up the sum code automatically (you don't need to rewrite the original code).&lt;/P&gt;

&lt;P&gt;If with Intel IPP , you can &amp;nbsp;call IPP function (like MKL function)&lt;/P&gt;

&lt;P&gt;ippsSum_32f(const Ipp32f* pSrc, int len, Ipp32f* pSum, IppHintAlgorithm&amp;nbsp;hint);&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;BR /&gt;
	Ying&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Example&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The example below shows how to use the function ippsSum.&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	void sum(void) {&lt;BR /&gt;
	Ipp16s x[4] = {-32768, 32767, 32767, 32767}, sm;&lt;BR /&gt;
	ippsSum_16s_Sfs(x, 4, &amp;amp;sm, 1);&lt;BR /&gt;
	printf_16s(“sum =”, &amp;amp;sm, 1, ippStsNoErr);&lt;BR /&gt;
	}&lt;BR /&gt;
	Output:&lt;BR /&gt;
	sum = 32766&lt;BR /&gt;
	Matlab* Analog:&lt;BR /&gt;
	&amp;gt;&amp;gt; x = [-32768, 32767, 32767, 32767]; sum(x)/2&lt;/P&gt;</description>
      <pubDate>Mon, 26 Dec 2016 08:53:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122283#M25036</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2016-12-26T08:53:21Z</dc:date>
    </item>
    <item>
      <title>Several proprietary</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122284#M25037</link>
      <description>&lt;P&gt;Several proprietary implementations of BLAS include ?sum, but MKL doesn't include this extension, presumably because normally it's more efficient simply to write an omp reduction loop or equivalent.&amp;nbsp; Sorry I overlooked this aspect of your subject.&amp;nbsp; You could write your own and achieve more efficiency than a substitution of ?dot with multiplication by a vector of 1's, which certainly isn't popular as a "best" alternative.&lt;/P&gt;</description>
      <pubDate>Mon, 26 Dec 2016 14:01:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122284#M25037</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-12-26T14:01:00Z</dc:date>
    </item>
    <item>
      <title>The Intel compiler is capable</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122285#M25038</link>
      <description>&lt;P&gt;The Intel compiler is capable of doing an excellent job of code generation for a simple summation loop in C, but there are a few things to look out for.&amp;nbsp; So start by rewriting the code to compute the sum with an explicit loop.&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Vectorization is most often inhibited by the potential for aliasing.&amp;nbsp;
		&lt;OL&gt;
			&lt;LI&gt;Use the "-qopt-report=5" compiler option and search the optimization report for messages relating to the new explicit sum loop.&lt;/LI&gt;
			&lt;LI&gt;I have not tested the code above, but the use of indirect addressing (i.e., starting the summation at "cuma&lt;I&gt;") will make it harder for the compiler to do a thorough aliasing analysis.&amp;nbsp; You should be able to force vectorization using "#pragma SIMD" immediately before the summation loop.&lt;/I&gt;&lt;/LI&gt;
			&lt;LI&gt;I have found that the fastest code often comes from using "#pragma omp parallel for reduction (+:sum)", where "sum" is the name of the variable used for summation.&amp;nbsp; Even if you only use one thread, the use of the OpenMP pragma seems to allow the compiler to be more aggressive about re-ordering the computations to improve vectorization.&amp;nbsp;
				&lt;OL&gt;
					&lt;LI&gt;I have not re-tested this with the most recent (2017) compilers -- it may not be required any more to get best performance, but it is probably still a useful option to test.&lt;/LI&gt;
					&lt;LI&gt;I have not tested the "#pragma omp parallel for reduction (+:...)" clause against the "#pragma omp simd reduction(+:..)" clause.&amp;nbsp;&lt;/LI&gt;
				&lt;/OL&gt;
			&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Alignment is a potential secondary performance issue.
		&lt;OL&gt;
			&lt;LI&gt;The performance impact of alignment depends fairly strongly on the processor generation, with a general trend toward better performance of unaligned loads and stores over time.&lt;/LI&gt;
			&lt;LI&gt;The worst performance problems are with unaligned stores.&amp;nbsp; The reduction operation only requires loads, so alignment is not likely to be a major issue.&lt;/LI&gt;
			&lt;LI&gt;You can expect to see "complaints" about alignment in the optimization report(s) because of the indirect access (i.e., starting the summation at index "cuma&lt;I&gt;"), but these are much less important than messages about vectorization.&lt;/I&gt;&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Mon, 26 Dec 2016 16:48:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Better-way-to-sum-the-elements-of-a-vector/m-p/1122285#M25038</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-26T16:48:13Z</dc:date>
    </item>
  </channel>
</rss>

