<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Quote:Vamsi Sripathi (Intel) in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012129#M19285</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vamsi Sripathi (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Kim,&lt;/P&gt;

&lt;P&gt;Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.&lt;/P&gt;

&lt;P&gt;Could you please provide the following info,&lt;/P&gt;

&lt;P&gt;1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby&lt;/P&gt;

&lt;P&gt;2. CPU architecture&lt;/P&gt;

&lt;P&gt;In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620&lt;/P&gt;</description>
    <pubDate>Fri, 05 Dec 2014 19:31:36 GMT</pubDate>
    <dc:creator>Kim_L_</dc:creator>
    <dc:date>2014-12-05T19:31:36Z</dc:date>
    <item>
      <title>about parallelism on BLAS level-1 routines and VML</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012127#M19283</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;

&lt;P&gt;I am running BLAS routines in MKL with intel compiler (icpc). Following the example given in the compiler, I try to set the numbers of threads from 1 to 10 while running dgemm routine for matrix-matrix multiplication and I saw the speedup while increasing the number of threads. However, for level-1 routines (e.g. cblas_zcopy, cblas_zaxpby), I didn't see any speed up for multithreading version. I wonder if there is any multi-threading version for level-1 routines or not? What about the VML routines? I also try to use those routines (e.g. vzExp, vzMul) but no speedup at all in multithreading environment.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2014 04:19:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012127#M19283</guid>
      <dc:creator>Kim_L_</dc:creator>
      <dc:date>2014-12-05T04:19:27Z</dc:date>
    </item>
    <item>
      <title>Hi Kim,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012128#M19284</link>
      <description>&lt;P&gt;Hi Kim,&lt;/P&gt;

&lt;P&gt;Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.&lt;/P&gt;

&lt;P&gt;Could you please provide the following info,&lt;/P&gt;

&lt;P&gt;1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby&lt;/P&gt;

&lt;P&gt;2. CPU architecture&lt;/P&gt;

&lt;P&gt;In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2014 19:26:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012128#M19284</guid>
      <dc:creator>Vamsi_S_Intel</dc:creator>
      <dc:date>2014-12-05T19:26:14Z</dc:date>
    </item>
    <item>
      <title>Quote:Vamsi Sripathi (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012129#M19285</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vamsi Sripathi (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Kim,&lt;/P&gt;

&lt;P&gt;Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.&lt;/P&gt;

&lt;P&gt;Could you please provide the following info,&lt;/P&gt;

&lt;P&gt;1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby&lt;/P&gt;

&lt;P&gt;2. CPU architecture&lt;/P&gt;

&lt;P&gt;In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2014 19:31:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012129#M19285</guid>
      <dc:creator>Kim_L_</dc:creator>
      <dc:date>2014-12-05T19:31:36Z</dc:date>
    </item>
    <item>
      <title>Quote:Vamsi Sripathi (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012130#M19286</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vamsi Sripathi (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Kim,&lt;/P&gt;

&lt;P&gt;Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.&lt;/P&gt;

&lt;P&gt;Could you please provide the following info,&lt;/P&gt;

&lt;P&gt;1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby&lt;/P&gt;

&lt;P&gt;2. CPU architecture&lt;/P&gt;

&lt;P&gt;In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I am looking for the multithread version to work so to speed up the code in efficient way. In my calculation, I have so many complicated calculations in the form&lt;/P&gt;

&lt;P&gt;alpha*x*conj(y)&lt;/P&gt;

&lt;P&gt;or&lt;/P&gt;

&lt;P&gt;exp(a*x + b*y)*z&lt;/P&gt;

&lt;P&gt;where alpha, a, b are constants and x, y, z are vectors. I am using&amp;nbsp; vzExp and vzMul to implement the first operation, and using cblas_zaxpby, vzExp, vzMul for the second one. Any better idea to do so? Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2014 19:37:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012130#M19286</guid>
      <dc:creator>Kim_L_</dc:creator>
      <dc:date>2014-12-05T19:37:02Z</dc:date>
    </item>
    <item>
      <title>Quote:Kim L. wrote:</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012131#M19287</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Kim L. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG class="quote-header"&gt;Quote:&lt;/STRONG&gt;&lt;/P&gt;

&lt;BLOCKQUOTE class="quote-msg quote-nest-1 odd"&gt;
	&lt;DIV class="quote-author"&gt;&lt;EM class="placeholder"&gt;Vamsi Sripathi (Intel)&lt;/EM&gt; wrote:&lt;/DIV&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;Hi Kim,&lt;/P&gt;

	&lt;P&gt;Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.&lt;/P&gt;

	&lt;P&gt;Could you please provide the following info,&lt;/P&gt;

	&lt;P&gt;1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby&lt;/P&gt;

	&lt;P&gt;2. CPU architecture&lt;/P&gt;

	&lt;P&gt;In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;here I would recommend to see at the&amp;nbsp;https://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material - foil #7 - Performance metric.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Dec 2014 08:34:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/about-parallelism-on-BLAS-level-1-routines-and-VML/m-p/1012131#M19287</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2014-12-06T08:34:21Z</dc:date>
    </item>
  </channel>
</rss>

