<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: OpenMP: slow-down for matrix-vector product in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866308#M2722</link>
    <description>&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;In addition to previous my message I shall note, that in all algorithms where it is used BLAS2, Intel MKL conducts calculations on one core.&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;Yurii&lt;/FONT&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 15 Aug 2007 06:23:53 GMT</pubDate>
    <dc:creator>abcd_qmost</dc:creator>
    <dc:date>2007-08-15T06:23:53Z</dc:date>
    <item>
      <title>OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866296#M2710</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I am an OpenMP beginner, and I am experimenting with simple examples on a quad-core CPU. When I parallelized a matrix-vector product, I got a small speed-up for 2 threads, but a slow-down for 4 threads. I would be very grateful if someone cleared it out for me why there is a slow-down?&lt;/P&gt;
&lt;P&gt;When the code below is executed with matrix dimension of 18000, the elapsed time is:&lt;/P&gt;
&lt;P&gt;4.01 seconds for 1 thread, 3.19s for 2 threads and 7.75s for 4 threads.&lt;/P&gt;
&lt;P&gt;Here is the code:&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;PROGRAM&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; MVP&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;INTEGER&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; NRA, NCA, TID, NTHREADS, I, J, K, CHUNK,&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;integer&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; start, finish, rate&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;REAL*8&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; A(:,:), B(:,:), C(:,:)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;allocatable&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; a,b,c&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;c Input the number of columns/rows of the matrix&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(*,*)'matrix dimension?'&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;read&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(*,*)NRA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;NCA=NRA&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;allocate&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(A(NRA,NCA),stat=istat)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;allocate&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(B(NCA,1),stat=istat)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;allocate&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(C(NRA,1),stat=istat)&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;C Set loop iteration chunk size &lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;CHUNK = 10&lt;P&gt;&lt;/P&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;c Input the number of threads&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(*,*)'number of threads?'&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;read&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(*,*)kp&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;call&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; omp_set_num_threads(kp)&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;C Initialize matrices&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; 30 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; 30 J=1, NCA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;A(I,J) = (I-1)+(J-1)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;30 &lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; 40 I=1, NCA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;B(I,1) = (I-1)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;40 &lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; 50 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;C(I,1) = 0&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;50 &lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;cccccccccccccccccccccccccccccc
ccccccccccccccccccccccccccccccccccccc&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;call&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; &lt;FONT color="#0000ff"&gt;system_clock&lt;/FONT&gt; (COUNT_RATE = rate)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;call&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; &lt;FONT color="#0000ff"&gt;system_clock&lt;/FONT&gt; (COUNT = start)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK)PRIVATE(TID,I,J,K)&lt;/P&gt;
&lt;P&gt;C Do matrix-vector multiply sharing iterations on outer loop&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;TID = OMP_GET_THREAD_NUM()&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;PRINT&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; *, 'Thread', TID, 'starting matrix multiply...'&lt;P&gt;&lt;/P&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;!$OMP DO SCHEDULE(STATIC, CHUNK)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; 60 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; 60 K=1, NCA&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;C(I,1) = C(I,1) + A(I,K) * B(K,1)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;60 &lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#008000"&gt;
&lt;P&gt;!$OMP END DO&lt;/P&gt;
&lt;P&gt;C End of parallel region &lt;/P&gt;
&lt;P&gt;!$OMP END PARALLEL&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;call&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt; &lt;FONT color="#0000ff"&gt;system_clock&lt;/FONT&gt; (COUNT = finish)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;seconds = &lt;FONT color="#0000ff"&gt;float&lt;/FONT&gt; (finish - start) / &lt;FONT color="#0000ff"&gt;float&lt;/FONT&gt; (rate)&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;(*,*)'time elapsed&lt;S&gt;:',seconds&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;END&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;/S&gt;</description>
      <pubDate>Fri, 27 Apr 2007 08:18:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866296#M2710</guid>
      <dc:creator>Dusan_Z_</dc:creator>
      <dc:date>2007-04-27T08:18:24Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866297#M2711</link>
      <description>&lt;P&gt;Generally, you should try to conform to the old slogan "Concurrent Outer, Vector Inner"&lt;/P&gt;
&lt;P&gt;Most OpenMP compilers will follow your code literally and not attempt loop nest optimizations which differ from your specification of the loop index used for parallelization, even when you choose compiler flags which might do so when parallelization is turned off.&lt;/P&gt;
&lt;P&gt;It might be interesting if you would tell us the effect of thread placement (KMP_AFFINITY or GOMP_AFFINITY, formy 2 favorite compilers)as well as the effect of switching the loops. It looks as if you are testing mainly how well threading can deal with DTLB misses, and you are nearly up to full bus bandwidth limited performance, with one thread. As you have not blocked for cache locality, you gain performance only when you spread your job across all cache. &lt;/P&gt;
&lt;P&gt;I believe that MKL doesn't attempt to thread ?GEMV, since the "vector inner" part of the slogan actually should come first. You should at least compare your performance with what you get with standard BLASSGEMV vectorized, with and without threading. Some people like to brag about maximizing OpenMP scaling even when it is gained at the expense of total performance; for that purpose, you can sometimes leave out the vector inner part, as it is easiest to get good scaling by making the single thread performance as bad as possible.&lt;/P&gt;
&lt;P&gt;You would need to find an effective way to block the operation so as to enable vectorization and preserve cache locality in each thread. Assuming you are using an Intel quad core, you might be able to take advantage of the cache sharing between cores 0,2 and 1,3, and you certainly need to block so that the 2 threads on the same cache don't fight each other.&lt;/P&gt;
&lt;P&gt;So, in spite of the apparent simplicity of this operation, it already requires you to go beyond the "OpenMP beginner" stage.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Apr 2007 13:21:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866297#M2711</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-04-27T13:21:06Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866298#M2712</link>
      <description>Thank you for a detailed answer. I now see that the problem is much more complex than I thought.&lt;BR /&gt;&lt;BR /&gt;Do you maybe know where to find examples of OpenMP parallelization of linear algebra operations like this one?&lt;BR /&gt;&lt;BR /&gt;I know that MKL has matrix-matrix product parallelized but they didn't do *GEMV. I will compare my example with SGEMV and see if it also slows down when I run it with 4 threads. Otherwise, I know it doesn't offer any speed-up, I 'm sure on that.&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Drazen&lt;BR /&gt;</description>
      <pubDate>Mon, 30 Apr 2007 10:50:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866298#M2712</guid>
      <dc:creator>Dusan_Z_</dc:creator>
      <dc:date>2007-04-30T10:50:47Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866299#M2713</link>
      <description>&lt;P&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;I'd guess you might want to block it something like&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#555555"&gt;Ichunk=(NRA+3)/4&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#555555"&gt;!$OMP PARALLEL DO &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#0000ff"&gt;&lt;STRONG&gt;DO&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt;J=0,3&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; K=1, NCA&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; I= J*Ichunk + 1, (J+1)*Ichunk&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#555555"&gt;C(I,1) = C(I,1) + A(I,K) * B(K,1)&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color="#ff0000"&gt;
&lt;P&gt;End do&lt;/P&gt;
&lt;P&gt;End do&lt;/P&gt;
&lt;P&gt;End do&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#555555"&gt;and try running with KMP_AFFINITY settings. Note that I didn't take care of remainders, in case NRA is not divisible by 4. This would work somewhat better if the arrays are 16-byte aligned (and the compiler knows it), and Ichunk is divisible by 8.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#555555"&gt;You could replace the loops on I and K with SGEMV calls on the same array sections. I agree with you, SGEMV probably isn't built with such threading.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;</description>
      <pubDate>Mon, 30 Apr 2007 12:58:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866299#M2713</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2007-04-30T12:58:04Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866300#M2714</link>
      <description>Thank you for your help. The code you posted doesn't experience any slow-down when number of threads is increased from 2 to 4. It doesn't speed-up either, but it's much better than the original one.&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Drazen&lt;BR /&gt;</description>
      <pubDate>Fri, 04 May 2007 06:44:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866300#M2714</guid>
      <dc:creator>Dusan_Z_</dc:creator>
      <dc:date>2007-05-04T06:44:28Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866301#M2715</link>
      <description>&lt;P&gt;And on this example you might want to specify the OMP schedule as static with chunk size of 1 (not to be confused with Ichunk). &lt;/P&gt;
&lt;P&gt;Jim&lt;/P&gt;</description>
      <pubDate>Fri, 04 May 2007 15:45:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866301#M2715</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-05-04T15:45:34Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866302#M2716</link>
      <description>&lt;P&gt;Drazen,&lt;/P&gt;
&lt;P&gt;I copied your program and compiled it for testing on my system. This system has 2 Opteron 270 Dual Core processors for a total of 4 cores. The system has 2GB of RAM. Using 18000 for array size would require 7.78GB so I could not run the test with arrays of that size. Using 10000 requires 2.4GB of RAM but since array C is only referenced as C(I,1) a size of 10000 seemed to fit on my system without excessive page swapping.&lt;/P&gt;
&lt;P&gt;Runtimes for my system&lt;/P&gt;
&lt;P&gt;CoresTime%1core%2core&lt;BR /&gt;16.86100%60.64%&lt;BR /&gt;24.16164.9%100%&lt;BR /&gt;32.969 231.1%140.11%&lt;BR /&gt;42.656 258.3%156.63%&lt;/P&gt;
&lt;P&gt;I also noticed that your loop indexing was not optimal. In Fortran, adjacent cells of the left most index are adjacent in memory&lt;/P&gt;
&lt;P&gt;Your code&lt;/P&gt;
&lt;P&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; 30 I=1, NRA&lt;BR /&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; 30 J=1, NCA&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt;A(I,J) = (I-1)+(J-1)&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;30 &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;Jumps down NRA variables each poke in memory.&lt;BR /&gt;Changing the loop order to&lt;/P&gt;
&lt;P&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; 30 J=1, NCA&lt;BR /&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; 30 I=1, NRA&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt;A(I,J) = (I-1)+(J-1)&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;30 &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#0000ff"&gt;&lt;FONT color="#000000"&gt;References adjecent memory on each poke in memory and is much faster.&lt;BR /&gt;Additionaly, the loop can now be vectorized (2 real(8) written each iteration)&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#0000ff"&gt;&lt;FONT color="#000000"&gt;A similar thing can be done with your timming test loop. Original code:&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#0000ff"&gt;&lt;STRONG&gt;DO&lt;/STRONG&gt; 60 I=1, NRA&lt;BR /&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; 60 K=1, NCA&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt;C(I,1) = C(I,1) + A(I,K) * B(K,1)&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;60 &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;/FONT&gt;&lt;/B&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#0000ff"&gt;&lt;FONT color="#000000"&gt;Suggested change&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color="#0000ff"&gt;&lt;FONT color="#000000"&gt;
&lt;P&gt;&lt;FONT color="#0000ff"&gt;&lt;STRONG&gt;DO&lt;/STRONG&gt; 60 K=1, NCA&lt;BR /&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt; 60 I=1, NRA&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#555555"&gt;C(I,1) = C(I,1) + A(I,K) * B(K,1)&lt;BR /&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000"&gt;60 &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff"&gt;CONTINUE&lt;/FONT&gt;&lt;/B&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#000000"&gt;With the changes in place my run times are too short to produce meaninful results using the timming routines of your choice.&lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/FONT&gt;</description>
      <pubDate>Thu, 12 Jul 2007 14:51:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866302#M2716</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-07-12T14:51:24Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866303#M2717</link>
      <description>&lt;P&gt;Yurii,&lt;/P&gt;
&lt;P&gt;The index priority for Fortran is backwards from C++ for multiple indexed arrays. As for C(I,J)&lt;/P&gt;
&lt;P&gt;Fortran -leftmost index indexes adjacent memory: C(I,J), C(I+1,J) adjacent&lt;/P&gt;
&lt;P&gt;C++ - rightmost index indexes adjacent memory C(I,J), C(I,J+1) adjacent&lt;/P&gt;
&lt;P&gt;The sample Fortran program from DRASKO was written as if indexed for C++ therefor the loops could not vectorize the instructions and the memory references tended to require more cache lines (i.e. cache would flush before accessing a variable co-resident with a prior variable). Loop nesting order is dependent on language array indexing precedence order.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jul 2007 19:19:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866303#M2717</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-07-12T19:19:37Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866304#M2718</link>
      <description>&lt;P&gt;&lt;FONT size="2"&gt;Jim,&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;I had no in view of a concrete example.&lt;BR /&gt;I had in view of the general principles.&lt;BR /&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;Yurii&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jul 2007 06:47:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866304#M2718</guid>
      <dc:creator>abc_qmost</dc:creator>
      <dc:date>2007-07-13T06:47:29Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866305#M2719</link>
      <description>&lt;P&gt;Yruii,&lt;/P&gt;
&lt;P&gt;RE: &lt;FONT size="2"&gt;&lt;FONT size="3"&gt;I had no in view of a concrete example&lt;/FONT&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;I took the liberty to modify Drasko's original code. Please compile and make your own test runs. The modified code uses a higher precision timmer and runs the timed loops indexed in both manners (concrete example).&lt;/P&gt;
&lt;P&gt;Note, each timed loop is run twice and the second value is used. This is done such that the cache is preconditioned in a manner that should not adversely effect (advantage) one run over the other. &lt;/P&gt;
&lt;P&gt;Loops 31 and 61 are are the alternate forms of loops 30 and 60 with the suggestions made in my earlier post. &lt;/P&gt;
&lt;P&gt;Reordering loop 61 (the major function to be optimized) results in over 5x performance improvement on my system. The reorder of the loop had more impact than throwing more cores at the problem (~3x performance increase from 1 core to 4 cores.) Using both more cores and re-order gives best results. Additional optimizations can be made that can yield a bit more performance.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;On my 4-Core system&lt;/P&gt;&lt;PRE&gt;matrix dimension?&lt;BR /&gt;10000&lt;BR /&gt;number of threads?&lt;BR /&gt;4&lt;BR /&gt;Run time loop 30 8.49819253990427&lt;BR /&gt;Run time loop 31 0.485419438919052&lt;BR /&gt;Difference 17.5069061074858&lt;BR /&gt;Run time loop 60 2.46630702423863&lt;BR /&gt;Run time loop 61 0.466605992522091&lt;BR /&gt;Difference 5.28563084007514&lt;/PRE&gt;&lt;PRE&gt;Program with modifications follows&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;PROGRAM&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; MVP&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;INTEGER&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: NRA, NCA, TID, NTHREADS, I, J, K, CHUNK&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;INTEGER&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;real(8)&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: OMP_GET_WTIME, OMP_GET_WTICK&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;integer&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: start, finish, rate&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;REAL(8)&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;, &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;allocatable&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: A(:,:), B(:,:), C(:,:)&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;real(8)&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: Start_WTIME, End_WTIME&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;real(8)&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: RT30, RT31, RT60, RT61&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;integer&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; :: cacheFlush&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Input the number of columns/rows of the matrix&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*)'matrix dimension?'&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;read&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*)NRA&lt;P&gt;&lt;/P&gt;&lt;P&gt;NCA=NRA&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;allocate&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(A(NRA,NCA),stat=istat)&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;allocate&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(B(NCA,1),stat=istat)&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;allocate&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(C(NRA,1),stat=istat)&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Set loop iteration chunk size &lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;CHUNK = 10&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Input the number of threads&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*)'number of threads?'&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;read&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*)kp&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;call&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; omp_set_num_threads(kp)&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Initialize matrices&lt;/P&gt;&lt;P&gt;! Perform test twice&lt;/P&gt;&lt;P&gt;! First iteration used to normalize cache&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;do&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; cacheFlush=0,1&lt;P&gt;&lt;/P&gt;&lt;P&gt;Start_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 30 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 30 J=1, NCA&lt;P&gt;&lt;/P&gt;&lt;P&gt;A(I,J) = (I-1)+(J-1)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000" size="2"&gt;&lt;P&gt;30&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;End_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;RT30 = End_WTIME - Start_WTIME&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;end do&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Perform test twice&lt;/P&gt;&lt;P&gt;! First iteration used to normalize cache&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;do&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; cacheFlush=0,1&lt;P&gt;&lt;/P&gt;&lt;P&gt;Start_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 31 J=1, NCA&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 31 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;P&gt;A(I,J) = (I-1)+(J-1)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000" size="2"&gt;&lt;P&gt;31&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;End_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;RT31 = End_WTIME - Start_WTIME&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;end do&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*) 'Run time loop 30', RT30&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*) 'Run time loop 31', RT31&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*) 'Difference', RT30/RT31&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 40 I=1, NCA&lt;P&gt;&lt;/P&gt;&lt;P&gt;B(I,1) = (I-1)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000" size="2"&gt;&lt;P&gt;40&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 50 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;P&gt;C(I,1) = 0&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000" size="2"&gt;&lt;P&gt;50&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Perform test twice&lt;/P&gt;&lt;P&gt;! First iteration used to normalize cache&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;do&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; cacheFlush=0,1&lt;P&gt;&lt;/P&gt;&lt;P&gt;Start_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Do matrix-vector multiply sharing iterations on outer loop&lt;/P&gt;&lt;P&gt;!$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 60 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 60 K=1, NCA&lt;P&gt;&lt;/P&gt;&lt;P&gt;C(I,1) = C(I,1) + A(I,K) * B(K,1)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000" size="2"&gt;&lt;P&gt;60&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;!$OMP END PARALLEL DO&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;End_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;RT60 = End_WTIME - Start_WTIME&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;end do&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;! Perform test twice&lt;/P&gt;&lt;P&gt;! First iteration used to normalize cache&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;do&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT siz="" e="2"&gt; cacheFlush=0,1&lt;P&gt;&lt;/P&gt;&lt;P&gt;Start_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;!$OMP PARALLEL DO SCHEDULE(STATIC, CHUNK) SHARED(A,B,C) PRIVATE(I,K)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 61 K=1, NCA&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;DO&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; 61 I=1, NRA&lt;P&gt;&lt;/P&gt;&lt;P&gt;C(I,1) = C(I,1) + A(I,K) * B(K,1)&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#ff0000" size="2"&gt;&lt;P&gt;61&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;CONTINUE&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;&lt;P&gt;!$OMP END PARALLEL DO&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;End_WTIME = OMP_GET_WTIME()&lt;/P&gt;&lt;P&gt;RT61 = End_WTIME - Start_WTIME&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;end do&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*) 'Run time loop 60', RT60&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*) 'Run time loop 61', RT61&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;write&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;(*,*) 'Difference', RT60/RT61&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT color="#0000ff" size="2"&gt;&lt;P&gt;END&lt;/P&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;FONT&gt;&lt;/FONT&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 13 Jul 2007 17:02:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866305#M2719</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-07-13T17:02:55Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866306#M2720</link>
      <description>&lt;P&gt;&lt;FONT style="BACKGROUND-COLOR: #d4d0c8" size="2"&gt;Jim,&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT style="BACKGROUND-COLOR: #d4d0c8" size="2"&gt;Probably I have badly explained the ideas. &lt;BR /&gt;I'll try to explain once again.&lt;BR /&gt;Tridiagonalization of matrixes will consist from Blas2 and Blas3.&lt;BR /&gt;For BLAS2 function dsymv responds.&lt;BR /&gt;Try it to execute on several cores.&lt;BR /&gt;The initial code is in package LAPACK.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT style="BACKGROUND-COLOR: #d4d0c8"&gt;&lt;FONT size="2"&gt;Yurii&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jul 2007 20:13:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866306#M2720</guid>
      <dc:creator>abc_qmost</dc:creator>
      <dc:date>2007-07-13T20:13:29Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866307#M2721</link>
      <description>&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;My first message has been removed by someone. Therefore I repeat the idea. For BLAS2 the most important is a competent programming. Performance BLAS2 on several cores for model of theshared memory is a blatant ignorance.&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;See: &lt;/FONT&gt;&lt;A href="http://www.thesa-store.com/products/"&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.thesa-store.com/products" target="_blank"&gt;http://www.thesa-store.com/products&lt;/A&gt;/&lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Yurii&lt;/DIV&gt;</description>
      <pubDate>Wed, 15 Aug 2007 05:39:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866307#M2721</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-15T05:39:04Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP: slow-down for matrix-vector product</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866308#M2722</link>
      <description>&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;In addition to previous my message I shall note, that in all algorithms where it is used BLAS2, Intel MKL conducts calculations on one core.&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;Yurii&lt;/FONT&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 15 Aug 2007 06:23:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-slow-down-for-matrix-vector-product/m-p/866308#M2722</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-15T06:23:53Z</dc:date>
    </item>
  </channel>
</rss>

