<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Poor parallelization for medium workloads - How does workload i in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807514#M877</link>
    <description>Thank you for the quick reply.&lt;BR /&gt;I am afraid I can not post the actual code.&lt;BR /&gt;I can show the for-loop pragma:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; #pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)&lt;BR /&gt;  for(p = 0 ; p &amp;lt; K ; p++)&lt;BR /&gt;  {&lt;BR /&gt;&lt;BR /&gt; DOSOMETHING;&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;K is the number of points.&lt;BR /&gt;P_pt is the point information (x,y, and other data).&lt;BR /&gt;start_output_pt is a pointer to the ouput data for each point. &lt;BR /&gt;&lt;BR /&gt;Can you elaborate on why parallelization only works efficiently for 100K+ instructions ?&lt;BR /&gt;I made an experiment and unrolled the for-loop to:&lt;BR /&gt;&lt;BR /&gt;#pragma omp parallel for shared(P_pt, start_output_pt) 
schedule(guided)&lt;BR /&gt;
  for(p = 0 ; p &amp;lt; K ; p=p+2)&lt;BR /&gt;
  {&lt;BR /&gt;
&lt;BR /&gt;
 DOSOMETHING;&lt;BR /&gt; DOSOMETHING;&lt;BR /&gt;
 }&lt;BR /&gt;&lt;BR /&gt;This had no impact on running time, even though each thread now has twice the workload.&lt;BR /&gt;&lt;BR /&gt;Any ideas ?&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Ianir.&lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Mon, 07 Feb 2011 14:06:41 GMT</pubDate>
    <dc:creator>Ianir_Ideses</dc:creator>
    <dc:date>2011-02-07T14:06:41Z</dc:date>
    <item>
      <title>Poor parallelization for medium workloads - How does workload impact parallelization ?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807511#M874</link>
      <description>Hi,&lt;BR /&gt;I am currently working on a highly optimized code, designed to run in an HPC environment.&lt;BR /&gt;This code computes a sequence of simple image processing operations in pixel neighborhoods for points in an image. Typically, I have an order of 1200-2000 points per image.&lt;BR /&gt;&lt;BR /&gt;This code runs on a 8 core Intel Xeon CPU E5420 @ 2.50GHz. The OS is CentOS 5, 64 bit.&lt;BR /&gt;The code is written in C and compiled using the Intel compiler. Multithreading is done by an openMP for-loop (guided) pragma on the points.&lt;BR /&gt;&lt;BR /&gt;The problem I am facing is that I do not get the X8 (not even X7) performance boost I am expecting.&lt;BR /&gt;This problem gets worse as I decrease the number of points and is alleviated as I increase them.&lt;BR /&gt;&lt;BR /&gt;For example, for 1200 points I get a relative speedup (compared to a single thread) of X4.3, for 2400 points X6.3, for 3600 points X7.1. So the typical speedup (1200 points) is relatively low.&lt;BR /&gt;&lt;BR /&gt;I am using the latest VTune to analyze this issue, so far I am not seeing any dominant parameters that explain this behaviour. I isolated the serial code and the speedup factors are the same as detailed above for the main parallelized for-loop. This suggests that it is not the serial parts that are holding the runtime back.&lt;BR /&gt;&lt;BR /&gt;I used the article "http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/" to measure tuning ratios.&lt;BR /&gt;&lt;BR /&gt;The ratios I measured look reasonable for both 1200 and 3600 points:&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN style="text-decoration: underline;"&gt;For 1200 point:&lt;/SPAN&gt;&lt;BR /&gt;&lt;P&gt;CPI = 0.80759 &lt;BR /&gt;Parallelization_ratio = 0.91599&lt;/P&gt;&lt;P&gt;Modified_data_sharing_ratio = 0.00087244
&lt;/P&gt;&lt;P&gt;L2_cache_miss = 324000
&lt;/P&gt;&lt;P&gt;Branch_misprediction_ratio = 0.0077442
&lt;/P&gt;Bus_utilization_ratio = 0.18217&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN style="text-decoration: underline;"&gt;For 3600 points:&lt;/SPAN&gt;&lt;BR /&gt;&lt;P&gt;CPI = 0.78496
&lt;/P&gt;&lt;P&gt;Parallelization_ratio = 0.99238&lt;/P&gt;&lt;P&gt;Modified_data_sharing_ratio = 0.00085696
&lt;/P&gt;&lt;P&gt;L2_cache_miss = 1089000
&lt;/P&gt;&lt;P&gt;Branch_misprediction_ratio = 0.0073767
&lt;/P&gt;Bus_utilization_ratio = 0.2105&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;According to the artice above, both sets of ratios are acceptable, however, the speedup is not up to par.&lt;BR /&gt;Is there another important ratio or event that may indicate what the probem is ?&lt;BR /&gt;&lt;BR /&gt;Thank you in advance,&lt;BR /&gt;Ianir.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 07 Feb 2011 12:32:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807511#M874</guid>
      <dc:creator>Ianir_Ideses</dc:creator>
      <dc:date>2011-02-07T12:32:27Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807512#M875</link>
      <description>I suppose, there are high parallelization_ratio only under vtune.&lt;BR /&gt;In order to get more parallelization_ratio underproduction load your algorithm requires more work for each thread, about 100000 or more instructions per thread.&lt;BR /&gt;My suggestions is to make task based parallelization, where 1 task = 1 image in case you have large amount of images.</description>
      <pubDate>Mon, 07 Feb 2011 12:54:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807512#M875</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T12:54:19Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807513#M876</link>
      <description>could you please show whole code or some pices of code: image declaration (like char image[1200]) and parallelized for cycle including pragma line?</description>
      <pubDate>Mon, 07 Feb 2011 12:59:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807513#M876</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T12:59:08Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807514#M877</link>
      <description>Thank you for the quick reply.&lt;BR /&gt;I am afraid I can not post the actual code.&lt;BR /&gt;I can show the for-loop pragma:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; #pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided)&lt;BR /&gt;  for(p = 0 ; p &amp;lt; K ; p++)&lt;BR /&gt;  {&lt;BR /&gt;&lt;BR /&gt; DOSOMETHING;&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;K is the number of points.&lt;BR /&gt;P_pt is the point information (x,y, and other data).&lt;BR /&gt;start_output_pt is a pointer to the ouput data for each point. &lt;BR /&gt;&lt;BR /&gt;Can you elaborate on why parallelization only works efficiently for 100K+ instructions ?&lt;BR /&gt;I made an experiment and unrolled the for-loop to:&lt;BR /&gt;&lt;BR /&gt;#pragma omp parallel for shared(P_pt, start_output_pt) 
schedule(guided)&lt;BR /&gt;
  for(p = 0 ; p &amp;lt; K ; p=p+2)&lt;BR /&gt;
  {&lt;BR /&gt;
&lt;BR /&gt;
 DOSOMETHING;&lt;BR /&gt; DOSOMETHING;&lt;BR /&gt;
 }&lt;BR /&gt;&lt;BR /&gt;This had no impact on running time, even though each thread now has twice the workload.&lt;BR /&gt;&lt;BR /&gt;Any ideas ?&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Ianir.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 07 Feb 2011 14:06:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807514#M877</guid>
      <dc:creator>Ianir_Ideses</dc:creator>
      <dc:date>2011-02-07T14:06:41Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807515#M878</link>
      <description>That's excellent parallel speedup for E54xx with schedule(guided). I suppose you set guided because you expect some loop iterations to do significantly more work than others. Possibly, particularly when there is just enough work to use all the cores, you will get better performance without schedule(guided), as setting affinity (e.g. KMP_AFFINITY=compact) works more effectively without (guided).</description>
      <pubDate>Mon, 07 Feb 2011 14:13:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807515#M878</guid>
      <dc:creator>timintel</dc:creator>
      <dc:date>2011-02-07T14:13:34Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807516#M879</link>
      <description>I'mjustinterested-whatspecific operationswith the images are you doing?</description>
      <pubDate>Mon, 07 Feb 2011 14:35:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807516#M879</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2011-02-07T14:35:43Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807517#M880</link>
      <description>Could you please get something about point and P_pt declaration? at least sizeof</description>
      <pubDate>Mon, 07 Feb 2011 14:37:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807517#M880</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T14:37:43Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807518#M881</link>
      <description>did you try other types of schedule? it seems you can get more performance using dynamic or or guieded WITH specified chunk size. it's like your unrolling (for(p = 0 ; p &amp;lt; K ; p=p+2)), but more simpler way&lt;BR /&gt;please try:&lt;BR /&gt;&lt;BR /&gt;#pragma omp parallel for shared(P_pt, start_output_pt) schedule(guided, X)&lt;BR /&gt;for(p = 0 ; p &amp;lt; K ; p++) {DOSOMETHING;}&lt;BR /&gt;&lt;BR /&gt;#pragma omp parallel for shared(P_pt, start_output_pt) schedule(dynamic, X)&lt;BR /&gt;for(p = 0 ; p &amp;lt; K ; p++) {DOSOMETHING;}&lt;BR /&gt;&lt;BR /&gt;where X= K/m/num_threads, m = from2to 10;</description>
      <pubDate>Mon, 07 Feb 2011 14:48:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807518#M881</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T14:48:52Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807519#M882</link>
      <description>about 100K+ instructions:&lt;BR /&gt;as i see, you used guided scheduling with default chunk size = 1, that means there are K simultaious tasks for 8 cores. Each task will be sheduled separately in order requested by threads. Scheduling is overhead which degradate your parallelization ratio.</description>
      <pubDate>Mon, 07 Feb 2011 14:57:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807519#M882</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T14:57:06Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807520#M883</link>
      <description>Yes:&lt;BR /&gt;float * P_pt = (float *) malloc(K*4*sizeof(float)); &lt;BR /&gt;(K is typically 1200-2000).&lt;BR /&gt;&lt;BR /&gt;The operations I do are basically a form of histograms in a spatial neighborhood.&lt;BR /&gt;&lt;BR /&gt;The guided schedule is indeed due to very different neighborhood sizes for different points, which results in varying runtime. I have tried setting the schedule to dynamic, but have not tried specifying chunk sizes, I will do so now. Thanks.</description>
      <pubDate>Mon, 07 Feb 2011 14:59:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807520#M883</guid>
      <dc:creator>Ianir_Ideses</dc:creator>
      <dc:date>2011-02-07T14:59:06Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807521#M884</link>
      <description>&lt;P&gt;lets try to clarify the rule of assigning of iterations in the guided scheduling:&lt;BR /&gt;P1 = C*K/(num_threads - 1);&lt;BR /&gt;P2 = C*(K-P1)/(num_threads - 1);&lt;BR /&gt;P3 = C*(K-P1-P2)/(num_threads - 1);&lt;BR /&gt;...&lt;BR /&gt;Pn = C*(K-P(n-1)...-P1)/(num_threads - 1);&lt;BR /&gt;... while Pn &amp;gt; chunksize&lt;BR /&gt;P(n+1) = chunksize&lt;BR /&gt;...&lt;BR /&gt;P(n+m-1) = chunksize&lt;BR /&gt;P(n+m) = rest iters &amp;lt;= chunksize&lt;BR /&gt;where C &amp;lt;= 1&lt;BR /&gt;&lt;BR /&gt;in case default chunksize=1&lt;BR /&gt;for K=1200, C=0.5 we have about 81 schedules, 14,6 iterations in average&lt;BR /&gt;for K=2000, C=0.5 we have about 89 schedules, 22,5 iterations in average -- more work per thread, that's because you have better ratio for K=2000&lt;BR /&gt;&lt;BR /&gt;in case chunksize=20&lt;BR /&gt;for K=1200, C=0.5 we have about34 schedules, 35,3 iterations in average&lt;BR /&gt;for K=2000, C=0.5 we have about41 schedules,48,8 iterations in average&lt;/P&gt;</description>
      <pubDate>Mon, 07 Feb 2011 15:32:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807521#M884</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T15:32:24Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807522#M885</link>
      <description>What means magic 4 in "float * P_pt = (float *) malloc(K*4*sizeof(float));"?&lt;BR /&gt;pay attentionto false sharing issues</description>
      <pubDate>Mon, 07 Feb 2011 15:53:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807522#M885</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T15:53:34Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807523#M886</link>
      <description>4 is the number of data elements inside each P_pt point.&lt;BR /&gt;&lt;BR /&gt;I tried the dynamic scheduling with different chunk sizes. What I saw was a decrease of CPU time spent in libiomp5.so, which is good - the scheduling overhead was reduced, however, the runtime for my code has not changed, speedup is not up to spec.&lt;BR /&gt;&lt;BR /&gt;As far as I understand, false sharing happens when writes invalidates the cache line. I am not sure this is the case here, since increasing the number of points (which means more for-loop iterations) led to the expected speedup. Moreover, the VTune params I measured do not seem to indicate to such a problem. Is there a hardware event in VTune that will measure such cases ?&lt;BR /&gt;&lt;BR /&gt;Thanks again, I appreciate the effort.&lt;BR /&gt;</description>
      <pubDate>Mon, 07 Feb 2011 16:15:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807523#M886</guid>
      <dc:creator>Ianir_Ideses</dc:creator>
      <dc:date>2011-02-07T16:15:32Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807524#M887</link>
      <description>&amp;gt;&amp;gt; I tried the dynamic scheduling with different chunk sizes&lt;BR /&gt;How about guided scheduling? I'm interested in results))&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;Is there a hardware event in VTune that will measure such cases ?&lt;BR /&gt;I don't know.</description>
      <pubDate>Mon, 07 Feb 2011 16:29:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807524#M887</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-07T16:29:51Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807525#M888</link>
      <description>&amp;gt;&amp;gt;The guided schedule is indeed due to very different neighborhood sizes for different points, which results in varying runtime. I have tried setting the schedule to dynamic, but have not tried specifying chunk sizes, I will do so now. Thanks&lt;BR /&gt;&lt;BR /&gt;Ideally you want to partition the work evenly with as few as thread scheduling interactions occuring. When your neighborhood sizes are relatively large (and of various sizes) then you might want to use a small chunk size (of number ofneighborhoods). When the size of the neighborhood is relatively small, then a larger chunk size would be in order. Therefore, consider partitioning your work in groups by size of neighborhood. Then use varying chunk size inversely proportional to the relative neighborhood size.&lt;BR /&gt;&lt;BR /&gt; large neighborhood sized... chunk = 1&lt;BR /&gt; medium neighborhood sized... chunk = n ! n tbd&lt;BR /&gt;small neighborhood sized... chunk =m !m tbd&lt;BR /&gt;&lt;BR /&gt;As to if you use 2, 3, ..., x different sized groupings, this would depend on your application.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Mon, 07 Feb 2011 16:50:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807525#M888</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-02-07T16:50:06Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807526#M889</link>
      <description>I tried several values for guided chunks, did not see any improvement, but setting it too high did cause the runtime to increase (it also shows in the VTune, as I see threads waiting).&lt;BR /&gt;&lt;BR /&gt;I think I will try to implement Jim Dempsey's suggestion, it looks like it may help with reducing scheduling overhead.&lt;BR /&gt;&lt;BR /&gt;It still doesn't solve the main problem - my main functions seem to work better when the workload is higher (say, number of iterations, or neighborhood size inside the function), but it is a start.&lt;BR /&gt;&lt;BR /&gt;Is there a recommended book that deals with such issues ?&lt;BR /&gt;&lt;BR /&gt;Thanks.&lt;BR /&gt;</description>
      <pubDate>Tue, 08 Feb 2011 07:52:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807526#M889</guid>
      <dc:creator>Ianir_Ideses</dc:creator>
      <dc:date>2011-02-08T07:52:33Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807527#M890</link>
      <description>&amp;gt;&amp;gt; Is there a recommended book that deals with such issues ?&lt;BR /&gt;&lt;BR /&gt;sorry, i have no any idea while i know nothing about calculations inside iterations ((</description>
      <pubDate>Tue, 08 Feb 2011 08:08:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807527#M890</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-08T08:08:51Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807528#M891</link>
      <description>OK.&lt;BR /&gt;Just one more thing, you mentioned before that parallelization will work well if I have more than 100,000 instructions. Do you know what is the reason for degradation for a lower number of instructions ?</description>
      <pubDate>Tue, 08 Feb 2011 11:31:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807528#M891</guid>
      <dc:creator>Ianir_Ideses</dc:creator>
      <dc:date>2011-02-08T11:31:04Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807529#M892</link>
      <description>answer: scheduling overhead.&lt;BR /&gt;TBB documentation suggests at least 100K clock cycle. I just "a little bit" increased to instructions.&lt;BR /&gt;see "Tutorial", search for"&lt;B&gt;&lt;SPAN style="font-family: Verdana; color: #0860a9; font-size: xx-small;"&gt;&lt;SPAN style="font-family: Verdana; color: #0860a9; font-size: xx-small;"&gt;&lt;SPAN style="font-family: Verdana; color: #0860a9; font-size: xx-small;"&gt;Packaging Overhead Versus Grainsize&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/B&gt;"&lt;BR /&gt;&lt;A href="http://threadingbuildingblocks.org/documentation.php"&gt;http://threadingbuildingblocks.org/documentation.php&lt;/A&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 08 Feb 2011 13:53:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807529#M892</guid>
      <dc:creator>Ilnar</dc:creator>
      <dc:date>2011-02-08T13:53:13Z</dc:date>
    </item>
    <item>
      <title>Poor parallelization for medium workloads - How does workload i</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807530#M893</link>
      <description>Chapman, Jost, van der Pas "Using OpenMP" is among the few good textbooks on OpenMP. They have a brief discussion of typical overheads in OpenMP, not including some important ones, such as the influence of affinity and cache issues.&lt;BR /&gt;You could look at the asm output file from your compiler and get an idea about the function calls implied by your OpenMP usage. Details under Intel OpenMP are intentionally hidden; with gnu compilers, you should be able to capture the full source code with expansion of OpenMP directives.</description>
      <pubDate>Tue, 08 Feb 2011 13:59:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-parallelization-for-medium-workloads-How-does-workload/m-p/807530#M893</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-02-08T13:59:41Z</dc:date>
    </item>
  </channel>
</rss>

