<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OpenMP no speedup in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901938#M4268</link>
    <description>&lt;P&gt;Hello, I've got several huge loops fashioned as follows :&lt;/P&gt;&lt;PRE&gt;for (unsigned int k = 1; k &amp;lt; ns_1; k++)&lt;BR /&gt;{&lt;BR /&gt;&lt;BR /&gt;for (unsigned int j = 1; j &amp;lt; ny_1; j++)&lt;BR /&gt;{&lt;BR /&gt;for (unsigned int i = 0; i &amp;lt; nx_1; i++)&lt;BR /&gt;{&lt;BR /&gt;*_C(UPtr) = quat_dtDivdx * (&lt;BR /&gt;*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));&lt;BR /&gt;THROW_COURANT(*_C(UPtr)); UPtr++;&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr += nx2; u_Ptr += nx2;&lt;/PRE&gt;&lt;PRE&gt;}&lt;/PRE&gt;&lt;PRE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.&lt;/P&gt;
&lt;P&gt;Say, I'd like to add an OpenMP support here. I do the following :&lt;/P&gt;&lt;PRE&gt;#ifdef _OPENMP&lt;BR /&gt;#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)&lt;BR /&gt;for (int k = 0; k &amp;lt; ns_2; k++)&lt;BR /&gt;{&lt;BR /&gt;// Thread-localize data sliders.&lt;BR /&gt;double&lt;BR /&gt;*loc_UPtr = UPtr + k * nx_1 * ny_2,&lt;/PRE&gt;&lt;PRE&gt;*loc_uPtr = uPtr + k * np,&lt;BR /&gt;*loc_u_Ptr = u_Ptr + k * np;&lt;/PRE&gt;&lt;PRE&gt;// Redefine data sliders.&lt;BR /&gt;#define UPtr loc_UPtr&lt;BR /&gt;#define uPtr loc_uPtr&lt;BR /&gt;#define u_Ptr loc_u_Ptr&lt;BR /&gt;#else&lt;BR /&gt;for (unsigned int k = 1; k &amp;lt; ns_1; k++)&lt;BR /&gt;{&lt;BR /&gt;#endif&lt;BR /&gt;for (unsigned int j = 1; j &amp;lt; ny_1; j++)&lt;BR /&gt;{&lt;BR /&gt;for (unsigned int i = 0; i &amp;lt; nx_1; i++)&lt;BR /&gt;{&lt;BR /&gt;*_C(UPtr) = quat_dtDivdx * (&lt;BR /&gt;*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));&lt;BR /&gt;THROW_COURANT(*_C(UPtr)); UPtr++;&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr += nx2; u_Ptr += nx2;&lt;/PRE&gt;&lt;PRE&gt;#ifdef _OPENMP&lt;BR /&gt;// Redefine data sliders.&lt;BR /&gt;#undef UPtr&lt;BR /&gt;#undef uPtr&lt;BR /&gt;#undef u_Ptr&lt;BR /&gt;#endif&lt;BR /&gt;}&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;This simple idea came after looking on basic OpenMP examples :&lt;/P&gt;
&lt;P&gt;1) set pragma for an outter loop&lt;/P&gt;
&lt;P&gt;2) for every slider to create and independent thread-local copy using the preprocessor definitions&lt;/P&gt;
&lt;P&gt;OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
    <pubDate>Sun, 20 Apr 2008 22:02:29 GMT</pubDate>
    <dc:creator>maemarcus</dc:creator>
    <dc:date>2008-04-20T22:02:29Z</dc:date>
    <item>
      <title>OpenMP no speedup</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901938#M4268</link>
      <description>&lt;P&gt;Hello, I've got several huge loops fashioned as follows :&lt;/P&gt;&lt;PRE&gt;for (unsigned int k = 1; k &amp;lt; ns_1; k++)&lt;BR /&gt;{&lt;BR /&gt;&lt;BR /&gt;for (unsigned int j = 1; j &amp;lt; ny_1; j++)&lt;BR /&gt;{&lt;BR /&gt;for (unsigned int i = 0; i &amp;lt; nx_1; i++)&lt;BR /&gt;{&lt;BR /&gt;*_C(UPtr) = quat_dtDivdx * (&lt;BR /&gt;*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));&lt;BR /&gt;THROW_COURANT(*_C(UPtr)); UPtr++;&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr += nx2; u_Ptr += nx2;&lt;/PRE&gt;&lt;PRE&gt;}&lt;/PRE&gt;&lt;PRE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Here _C( ) and _R( ) are macroses related to numerical patterns, i.e. central point and right point. Ptrs are sliders that are moving over one-dimensional arrays. So, a most common loop for some computational algorithm.&lt;/P&gt;
&lt;P&gt;Say, I'd like to add an OpenMP support here. I do the following :&lt;/P&gt;&lt;PRE&gt;#ifdef _OPENMP&lt;BR /&gt;#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)&lt;BR /&gt;for (int k = 0; k &amp;lt; ns_2; k++)&lt;BR /&gt;{&lt;BR /&gt;// Thread-localize data sliders.&lt;BR /&gt;double&lt;BR /&gt;*loc_UPtr = UPtr + k * nx_1 * ny_2,&lt;/PRE&gt;&lt;PRE&gt;*loc_uPtr = uPtr + k * np,&lt;BR /&gt;*loc_u_Ptr = u_Ptr + k * np;&lt;/PRE&gt;&lt;PRE&gt;// Redefine data sliders.&lt;BR /&gt;#define UPtr loc_UPtr&lt;BR /&gt;#define uPtr loc_uPtr&lt;BR /&gt;#define u_Ptr loc_u_Ptr&lt;BR /&gt;#else&lt;BR /&gt;for (unsigned int k = 1; k &amp;lt; ns_1; k++)&lt;BR /&gt;{&lt;BR /&gt;#endif&lt;BR /&gt;for (unsigned int j = 1; j &amp;lt; ny_1; j++)&lt;BR /&gt;{&lt;BR /&gt;for (unsigned int i = 0; i &amp;lt; nx_1; i++)&lt;BR /&gt;{&lt;BR /&gt;*_C(UPtr) = quat_dtDivdx * (&lt;BR /&gt;*_C(u_Ptr) + *_R(u_Ptr) + *_C(uPtr) + *_R(uPtr));&lt;BR /&gt;THROW_COURANT(*_C(UPtr)); UPtr++;&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr++; u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;uPtr += nx2; u_Ptr += nx2;&lt;/PRE&gt;&lt;PRE&gt;#ifdef _OPENMP&lt;BR /&gt;// Redefine data sliders.&lt;BR /&gt;#undef UPtr&lt;BR /&gt;#undef uPtr&lt;BR /&gt;#undef u_Ptr&lt;BR /&gt;#endif&lt;BR /&gt;}&lt;/PRE&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;This simple idea came after looking on basic OpenMP examples :&lt;/P&gt;
&lt;P&gt;1) set pragma for an outter loop&lt;/P&gt;
&lt;P&gt;2) for every slider to create and independent thread-local copy using the preprocessor definitions&lt;/P&gt;
&lt;P&gt;OK, now please let me ask the question : why does the threading extension described above brings absolutely NO benefit on the dual-core machine? = I mean, there is no speedup, timings (I use the clock() function from ) are almost equal. However in task manager I can see that with _OPENMP both cores get busy by the program's process. What is the reason?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Sun, 20 Apr 2008 22:02:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901938#M4268</guid>
      <dc:creator>maemarcus</dc:creator>
      <dc:date>2008-04-20T22:02:29Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP no speedup</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901939#M4269</link>
      <description>clock() measures total time used by all threads, so you would have an excellent result if that time doesn't increase with threading. OpenMP provides the function omp_get_wtime() for measuring elapsed time, on Intel compatible platforms, __rdtsc() may be useful.&lt;BR /&gt;From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are updated by the other.&lt;BR /&gt;</description>
      <pubDate>Mon, 21 Apr 2008 01:55:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901939#M4269</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-04-21T01:55:49Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP no speedup</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901940#M4270</link>
      <description>&lt;P&gt;Hello, Tim,&lt;/P&gt;
&lt;P&gt;Thanks for reply,&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt; From your description, I'm not certain if you have possible dependency problems, where one thread uses data which are &amp;gt; updated by the other.&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;I suppose mythreads are not data-dependent. The generalformula is U = F(u, u_) - here u and u_ are read-only, U is not self-dependent. To make it clear, let me providethe preprocessed source :&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;double &lt;BR /&gt;*UPtr = this-&amp;gt;get_TopLevel()-&amp;gt;get_Values(),&lt;BR /&gt;*u_Ptr = uFlow-&amp;gt;levels[uFlow-&amp;gt;levelsCount - 1]-&amp;gt;get_Values() +&lt;BR /&gt;nx + np,&lt;BR /&gt;*uPtr = uFlow-&amp;gt;get_TopLevel()-&amp;gt;get_Values() + nx + np;&lt;/PRE&gt;&lt;PRE&gt;&lt;BR /&gt;#pragma omp parallel for shared(UPtr, uPtr, u_Ptr)&lt;BR /&gt;for (int k = 0; k &amp;lt; ns_2; k++)&lt;BR /&gt;{&lt;BR /&gt;&lt;BR /&gt;double&lt;BR /&gt;*loc_UPtr = UPtr + k * nx_1 * ny_2,&lt;/PRE&gt;&lt;PRE&gt;*loc_uPtr = uPtr + k * np,&lt;BR /&gt;*loc_u_Ptr = u_Ptr + k * np;&lt;/PRE&gt;&lt;PRE&gt;for (unsigned int j = 1; j &amp;lt; ny_1; j++)&lt;BR /&gt;{&lt;BR /&gt;for (unsigned int i = 0; i &amp;lt; nx_1; i++)&lt;BR /&gt;{&lt;BR /&gt;*((loc_UPtr)) = quat_dtDivdx * (&lt;BR /&gt;*((loc_u_Ptr)) + *((loc_u_Ptr + 1)) + *((loc_uPtr)) + *((loc_uPtr + 1)));&lt;BR /&gt;if (abs(*((loc_UPtr))) &amp;gt; 1e0) throw *((loc_UPtr));; loc_UPtr++;&lt;/PRE&gt;&lt;PRE&gt;loc_uPtr++; loc_u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;loc_uPtr++; loc_u_Ptr++;&lt;BR /&gt;}&lt;/PRE&gt;&lt;PRE&gt;loc_uPtr += nx2; loc_u_Ptr += nx2;&lt;/PRE&gt;&lt;PRE&gt;}&lt;BR /&gt;}&lt;/PRE&gt;
&lt;P&gt;So here, in parallel version, I'm trying to provide each k-iteration with independent sliders copies (names starting with loc_) and corresponding offsets.&lt;/P&gt;
&lt;P&gt;Now, about timing. When enclosing the cycle above in clock()-s,the result varies from 0.0149 to 0.016 sec, same for serial and parallel versions. If I change clock()-s to omp_get_wtime(), the result varies from 0.0156 to 0.018 secfor serial and from0.0110 to 0.118 sec for parallel. This timings differ a little from test to test, anyway as for omp_get_wtime()parallel seems to be30% faster than serial.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2008 08:02:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901940#M4270</guid>
      <dc:creator>maemarcus</dc:creator>
      <dc:date>2008-04-21T08:02:01Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP no speedup</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901941#M4271</link>
      <description>I've heard for OpenMP it is worse to parallelize a code containing pointers (like I do) instead of arrays with indexers []. Is it true?</description>
      <pubDate>Sun, 04 May 2008 07:22:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901941#M4271</guid>
      <dc:creator>maemarcus</dc:creator>
      <dc:date>2008-05-04T07:22:27Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP no speedup</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901942#M4272</link>
      <description>Pointers wouldn't necessarily be a problem if you made them and the loop indices private.&lt;BR /&gt;</description>
      <pubDate>Sun, 04 May 2008 13:46:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901942#M4272</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-05-04T13:46:10Z</dc:date>
    </item>
    <item>
      <title>Re: OpenMP no speedup</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901943#M4273</link>
      <description>OK, thanks!</description>
      <pubDate>Sun, 04 May 2008 13:58:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-no-speedup/m-p/901943#M4273</guid>
      <dc:creator>maemarcus</dc:creator>
      <dc:date>2008-05-04T13:58:28Z</dc:date>
    </item>
  </channel>
</rss>

