<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why (in my simple case) doesn't OpenMP provide a 2x perform in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915884#M4812</link>
    <description>&lt;P&gt;If you are adventuresome, open a disassembly window and see if the reduction operator +:s is being performed inside the loop or if a temp is used within the loop.&lt;/P&gt;
&lt;P&gt;Jim&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 20 Aug 2007 12:59:29 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2007-08-20T12:59:29Z</dc:date>
    <item>
      <title>Why (in my simple case) doesn't OpenMP provide a 2x performance boost?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915882#M4810</link>
      <description>&lt;PRE&gt;Here is my simple program.&lt;BR /&gt;&lt;BR /&gt;int main()&lt;BR /&gt;{&lt;BR /&gt;    LARGE_INTEGER f, t1, t2;&lt;BR /&gt;&lt;BR /&gt;    int s = 0;&lt;BR /&gt;&lt;BR /&gt;    QueryPerformanceFrequency( &amp;amp;f );&lt;BR /&gt;    QueryPerformanceCounter( &amp;amp;t1 );&lt;BR /&gt;&lt;BR /&gt;    int i;&lt;BR /&gt;#pragma omp parallel for reduction(+:s)&lt;BR /&gt;    for ( i = 0; i &amp;lt; 1000000000; i++ )&lt;BR /&gt;    {&lt;BR /&gt;        s += i / 3178;&lt;BR /&gt;    }&lt;BR /&gt;&lt;BR /&gt;    QueryPerformanceCounter( &amp;amp;t2 );&lt;BR /&gt;&lt;BR /&gt;    printf( "%d %lf
", s, static_cast( t2.QuadPart - t1.QuadPart ) / f.QuadPart );&lt;BR /&gt;&lt;/PRE&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;I am using VS2005. When I compile is with/without the omp pragma I get times of 1.38/1.94 correspondingly. I wonder why it speeds up only by 40% and not by 100%. My CPU is a Core 2 Duo E6300. Seen the same behavior on AMD X2...&lt;BR /&gt;Thank you. 8 - )&lt;BR /&gt;</description>
      <pubDate>Thu, 19 Jul 2007 04:56:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915882#M4810</guid>
      <dc:creator>caa</dc:creator>
      <dc:date>2007-07-19T04:56:32Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915883#M4811</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;Hmm, it's not a bandwidth problem. I could only think that maybe the division operation you do does not take the same amount of time for each calculation. Division algorithms in hardware can vary in time taken. &lt;/P&gt;
&lt;P&gt;My suggestion to test this, is to try schedule (dynamic, 1000) and see if it improves. Of course if you use the Thread Profiler you'll get a concrete view of what the problem is.&lt;/P&gt;
&lt;P&gt;Aaron&lt;/P&gt;</description>
      <pubDate>Mon, 20 Aug 2007 09:53:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915883#M4811</guid>
      <dc:creator>Aaron_C_Intel</dc:creator>
      <dc:date>2007-08-20T09:53:49Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915884#M4812</link>
      <description>&lt;P&gt;If you are adventuresome, open a disassembly window and see if the reduction operator +:s is being performed inside the loop or if a temp is used within the loop.&lt;/P&gt;
&lt;P&gt;Jim&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 20 Aug 2007 12:59:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915884#M4812</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-08-20T12:59:29Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915885#M4813</link>
      <description>&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;For model of theshared memory and your type of a cycle it is good result.&lt;/FONT&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 21 Aug 2007 13:32:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915885#M4813</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-21T13:32:14Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915886#M4814</link>
      <description>&lt;DIV id="r_text"&gt;If you wish to improve results for such type of calculations irrespective of number of cores for model of theshared memory, increase its frequency (see: &lt;A href="http://www.thesa-store.com/products" target="_blank"&gt;&lt;FONT color="#0068cf"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.thesa-store.com/products" target="_blank"&gt;http://www.thesa-store.com/products&lt;/A&gt;) &lt;/DIV&gt;
&lt;DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;Yurii&lt;/DIV&gt;</description>
      <pubDate>Tue, 21 Aug 2007 13:47:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915886#M4814</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-21T13:47:54Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915887#M4815</link>
      <description>&lt;P&gt;Caa,&lt;/P&gt;
&lt;P&gt;Try adding schedule(static) to the #pragma.&lt;/P&gt;
&lt;P&gt;#pragma omp parallel for schedule(static) reduction(+:s)&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Aug 2007 14:52:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915887#M4815</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-08-21T14:52:23Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915888#M4816</link>
      <description>&lt;P&gt;&lt;FONT size="2"&gt;Jim, &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;I think, that for beginning the most important - ??? the general &lt;BR /&gt;principles, and then already details.&lt;BR /&gt; For parallel architecture with model of the shared &lt;BR /&gt;memory the size of a cache and its competent use is important.&lt;BR /&gt;For BLAS3 it is achieved by competent programming. &lt;BR /&gt;For example, I BLAS3 is much faster BLAS3 from Inek MKL for IA32.&lt;BR /&gt; For BLAS2 and other settlement methods where effectively to use a cache it is &lt;BR /&gt;impossible, are important both competent programming (I BLAS2 is much faster &lt;BR /&gt;BLAS2 from Inek MKL for IA32 and EM64T), and frequency of operative &lt;BR /&gt;memory. For example, Intel MKL at use BLAS2 manages only one core.&lt;BR /&gt;(see my page: &lt;/FONT&gt;&lt;A href="http://www.thesa-store.com/products"&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.thesa-store.com/products" target="_blank"&gt;http://www.thesa-store.com/products&lt;/A&gt;&lt;FONT size="2"&gt;) &lt;BR /&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;
&lt;P&gt;&lt;FONT size="2"&gt;Yurii&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Aug 2007 15:55:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915888#M4816</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-21T15:55:37Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915889#M4817</link>
      <description>&lt;P&gt;Yruii,&lt;/P&gt;
&lt;P&gt;Please take the time to look at the first message in this thread. In there is a sample program containing a FOR loop with one integer expression statement. The FOR loop is parallelized (assuming 2 threads on my part), and with a reduction operator.The index i, for each thread, should be registerized, the local copy of "s" should also be registerized, the entire for loop should fit into the instruction cache (a few 10's of bytes). &lt;/P&gt;
&lt;P&gt;Therefore the user's question of why not a 2x speedup was quite valid. The instruction streams, once in cache, should not interfere with each other, and all the rest of the code should be using registers. Therefore a ~2x speedup would be expected .... but was not observed.&lt;/P&gt;
&lt;P&gt;There are a few things to account for this.&lt;/P&gt;
&lt;P&gt;1) something else is sucking up processor resources (I think the use is savvy enough to eliminate this).&lt;/P&gt;
&lt;P&gt;2) The threads interfere with one another. This could be due to how often the reduction occurs. If the #pragma contained "schedule(static,1)" then there would be 0.5E+9 potential collisions on performing the reduction on the sum s. However, if the #pragma contained "schedule(static)" then the for loops only have one potential instance for collision performing reduction on reduction of the sum of s.&lt;/P&gt;
&lt;P&gt;3) The threads do not interfere with one another but the computation loads are not balanced between the threads. This can occur with unfavorable selections of scheduling of various types. e.g. with 2 threads and 1000000000 iterations schedule(static,333333333) would result in 2 processors at 100% for 50% of the time (2/3rd of the process) and one processor at 100% for the other 50% of the time (the remaining 1/3rd of the process). The end result would beexecution taking 66% of the time over single thread. Schedule(dynamic[,chunk]) may have similar characteristics as well as the other forms of schedule.&lt;/P&gt;
&lt;P&gt;The default schedule is implementation dependent and overridable with environment variables. The original post did not contain enough information to determine the schedule method.&lt;/P&gt;
&lt;P&gt;This thread has nothing to do with BLAS - please refrain from plugging your I BLAS3 product as it is not helping this user.&lt;/P&gt;
&lt;P&gt;Respectfully yours,&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Aug 2007 19:48:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915889#M4817</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-08-21T19:48:09Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915890#M4818</link>
      <description>&lt;P&gt;&lt;FONT style="BACKGROUND-COLOR: #d4d0c8" size="2"&gt;Jim,&lt;/FONT&gt;&lt;/P&gt;
&lt;DIV id="r_text"&gt;
&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;I agree, that BLAS3 nothing can help with the given problem. I compared BLAS2 and BLAS3. Also explained, that, unlike BLAS3, BLAS2 it is not meaningful to carry out more than on one core. To this strategy adheres Intel MKL. I tried to explain, that the initial example basically cannot come nearer to 100 %.&lt;/FONT&gt;&lt;/DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;Yurii&lt;/FONT&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 22 Aug 2007 10:18:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915890#M4818</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-22T10:18:10Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915891#M4819</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;so I ran the same code on my Dual Xeon 51xx (Dual-core 3.0 GHz) machine and I get perfect scalability from 1.39 to .39. The only problem I could catch maybe was your calculation of the time. I don't know if that was a typo you used with the static_cast or not. But below I used (double). Anyways, can you try this code and see if you get same problem?&lt;/P&gt;
&lt;P&gt;Here is the code I used to be exact:&lt;/P&gt;&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;#include&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;FONT color="#800000" size="2"&gt;"stdafx.h"&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;#include&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;FONT color="#800000" size="2"&gt;&lt;STDIO.H&gt;&lt;P&gt;&lt;/P&gt;&lt;/STDIO.H&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;#include&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; &lt;/FONT&gt;&lt;FONT color="#800000" size="2"&gt;&lt;WINDOWS.H&gt;&lt;P&gt;&lt;/P&gt;&lt;/WINDOWS.H&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;#define&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; MAX 1000000000&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#008000" size="2"&gt;
&lt;P&gt;//#define MAX 1000000000&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;int&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; _tmain(&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;int&lt;/FONT&gt;&lt;FONT size="2"&gt; argc, _TCHAR* argv[])&lt;P&gt;&lt;/P&gt;
&lt;P&gt;{&lt;/P&gt;
&lt;P&gt;LARGE_INTEGER f, t1, t2;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;int&lt;/FONT&gt;&lt;FONT size="2"&gt; s = 0;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;QueryPerformanceFrequency( &amp;amp;f );&lt;/P&gt;
&lt;P&gt;QueryPerformanceCounter( &amp;amp;t1 );&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;int&lt;/FONT&gt;&lt;FONT size="2"&gt; i;&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;
&lt;P&gt;#pragma&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt; omp parallel &lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;for&lt;/FONT&gt;&lt;FONT size="2"&gt; reduction(+:s) schedule(&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;static&lt;/FONT&gt;&lt;FONT size="2"&gt;) &lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;for&lt;/FONT&gt;&lt;FONT size="2"&gt; ( i = 0; i &amp;lt; MAX ; i++ ) {&lt;P&gt;&lt;/P&gt;
&lt;P&gt;s += i / 3178;&lt;/P&gt;
&lt;P&gt;} &lt;/P&gt;
&lt;P&gt;QueryPerformanceCounter( &amp;amp;t2 );&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;printf ( &lt;/P&gt;&lt;/FONT&gt;&lt;FONT color="#800000" size="2"&gt;"%d %lf
"&lt;/FONT&gt;&lt;FONT size="2"&gt;, s, (&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;double&lt;/FONT&gt;&lt;FONT size="2"&gt;)( t2.QuadPart - t1.QuadPart ) / (&lt;/FONT&gt;&lt;FONT color="#0000ff" size="2"&gt;double&lt;/FONT&gt;&lt;FONT size="2"&gt;)f.QuadPart );&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;&lt;/FONT&gt;</description>
      <pubDate>Wed, 22 Aug 2007 12:23:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915891#M4819</guid>
      <dc:creator>Aaron_C_Intel</dc:creator>
      <dc:date>2007-08-22T12:23:46Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915892#M4820</link>
      <description>&lt;P&gt;accoday,&lt;/P&gt;
&lt;P&gt;I took the liberty to modify your program to enclose the test in a loop to process from 1 to number of cores on my system. This gives me results for 1, 2, 3, 4 cores in one test run.&lt;/P&gt;
&lt;P&gt;Changing loop count to 2000000000 and running in 32-bit mode on my x64 server I read&lt;/P&gt;
&lt;P&gt;cores timemultiplier&lt;BR /&gt;14.687090&lt;BR /&gt;22.3602601.9858363&lt;BR /&gt;31.5791812.96805116&lt;BR /&gt;41.1927403.92968291&lt;/P&gt;
&lt;P&gt;Close to linear scaling.&lt;/P&gt;
&lt;P&gt;Using MS VC++. Except for an immdiate constant (inverse of 3178) the other variables in the compute loop were registerized. i.e. there should be no cache conflicts until the loop ends.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Aug 2007 19:22:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915892#M4820</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2007-08-22T19:22:25Z</dc:date>
    </item>
    <item>
      <title>Re: Why (in my simple case) doesn't OpenMP provide a 2x perform</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915893#M4821</link>
      <description>&lt;P&gt;&lt;FONT style="BACKGROUND-COLOR: #d4d0c8" size="2"&gt;Caa,&lt;/FONT&gt;&lt;/P&gt;
&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;Your basic mistake - the number ofcores (threads)is not specified.&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;
&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;One more mistake - s there should be more, than a maximal integer. &lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;Therefore the result turns out negative.&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;WINDOWS.H&gt;&lt;BR /&gt;#include &lt;OMP.H&gt;&lt;/OMP.H&gt;&lt;/WINDOWS.H&gt;&lt;/STDIO.H&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;#define NUM_THREADS 2&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;int main(){ &lt;BR /&gt;LARGE_INTEGER f, t1, t2; &lt;BR /&gt;int s=0;&lt;BR /&gt;QueryPerformanceFrequency(&amp;amp;f); &lt;BR /&gt;QueryPerformanceCounter(&amp;amp;t1); &lt;BR /&gt;int i;&lt;BR /&gt;omp_set_num_threads(NUM_THREADS);&lt;BR /&gt;#pragma omp parallel for reduction(+:s) &lt;BR /&gt;for ( i = 0; i &amp;lt; 1000000000; i++ ) { &lt;BR /&gt; s += i / 3178; &lt;BR /&gt;} &lt;BR /&gt;QueryPerformanceCounter(&amp;amp;t2); &lt;BR /&gt;printf( "%d %f
", s, (double)(t2.QuadPart - t1.QuadPart) / (double)f.QuadPart);&lt;BR /&gt;}&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;NUM_THREADS=1:&lt;/FONT&gt;&lt;/DIV&gt;&lt;FONT size="2"&gt;-2086857720 1.583106&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;NUM_THREADS=2:&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;-2086857720 0.791723&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;
&lt;DIV id="r_text"&gt;&lt;FONT size="2"&gt;As you can see, results remarkable. &lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT size="2"&gt;Yurii&lt;/FONT&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 22 Aug 2007 20:17:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Why-in-my-simple-case-doesn-t-OpenMP-provide-a-2x-performance/m-p/915893#M4821</guid>
      <dc:creator>abcd_qmost</dc:creator>
      <dc:date>2007-08-22T20:17:42Z</dc:date>
    </item>
  </channel>
</rss>

