<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Jingbo. in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100162#M126592</link>
    <description>&lt;P&gt;Jingbo.&lt;/P&gt;

&lt;P&gt;Your test program is essentially memory (write) bandwidth limited. Very little computation between writes. Your CPU, has 2 cores, 4 threads. More importantly it has 2 memory channels. Memory bandwidth limited applications tend to scale by the number of memory channels as opposed to number of hardware threads. Floating point intensive applications&amp;nbsp;(that are not memory intensive) tend to scale by the number of cores (vector units). Scalar intensive (that are not memory intensive) tend to scale by the number of hardware threads (though to a lesser extent for applications with larger cache utilization). There is no such thing as a typical application. Applications in general with have a mix "intensiveness". You will have to take the knowledge you learn from experience (currently at beginner level), use that to determine where/how to address issues of improving opportunities for vectorization and where/how to parallelize the code. This (learning experience) will take time with some experimentation on your part. The users on this forum will assist you, and assist you best when you show initiative with advice given.&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&amp;nbsp;the outer iterations&amp;nbsp;are sequenced and are not suitable for parallelization&lt;/P&gt;

&lt;P&gt;Do not assume this without complete understanding of what is being performed. Your code could be suitable for:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;Preamble | YourParallelCode | Postamble
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Preamble&amp;nbsp; | YourParallelCode | Postamble
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Preamble&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | YourParallelCode | Postamble&lt;/PRE&gt;

&lt;P&gt;Where the preamble and postamble sections may be doing things like file i/o or&amp;nbsp;state propagation.&lt;/P&gt;

&lt;P&gt;IOW you can use parallelization to overlap the preamble/postamble with the (parallel) compute section.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Mon, 08 May 2017 15:20:13 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2017-05-08T15:20:13Z</dc:date>
    <item>
      <title>Why my parallel code by OpenMP is much slower than the sequential code?</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100152#M126582</link>
      <description>&lt;P&gt;I tried to use OpenMP to parallelize&amp;nbsp;an inner loop.&amp;nbsp;The code is as&amp;nbsp;the following (also see the attachment)&lt;/P&gt;

&lt;P&gt;******************************************************************************************&lt;/P&gt;

&lt;P&gt;program test&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	implicit none&lt;/P&gt;

&lt;P&gt;REAL(8)::time1,time2&lt;BR /&gt;
	real(8),allocatable,dimension(:)::b&lt;BR /&gt;
	integer:: k,j,i&lt;/P&gt;

&lt;P&gt;j=70000&lt;BR /&gt;
	allocate(b(j))&lt;BR /&gt;
	b=0.0D+00&lt;/P&gt;

&lt;P&gt;call cpu_time(time1)&lt;BR /&gt;
	do k=1,40000&lt;BR /&gt;
	&amp;nbsp;do i=1,j&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;b(i)=real(i,kind=8)&lt;BR /&gt;
	&amp;nbsp;enddo&lt;BR /&gt;
	enddo&lt;BR /&gt;
	call cpu_time(time2)&lt;BR /&gt;
	write(*,*)'The cpu time (s) by the sequential code is',time2-time1&lt;/P&gt;

&lt;P&gt;call cpu_time(time1)&lt;BR /&gt;
	do k=1,40000&lt;BR /&gt;
	&amp;nbsp;!$OMP PARALLEL DO private(i)&lt;BR /&gt;
	&amp;nbsp;do i=1,j&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;b(i)=real(i,kind=8)&lt;BR /&gt;
	&amp;nbsp;enddo&lt;BR /&gt;
	&amp;nbsp;!$OMP END PARALLEL DO&amp;nbsp;&lt;BR /&gt;
	enddo&lt;BR /&gt;
	call cpu_time(time2)&lt;BR /&gt;
	write(*,*)'The cpu time (s) by the parallel code is',time2-time1&lt;BR /&gt;
	&amp;nbsp;&lt;BR /&gt;
	end program test&lt;/P&gt;

&lt;P&gt;******************************************************************************************&lt;/P&gt;

&lt;P&gt;The file name of the code is 'test.f90'. I built from the command line:&lt;/P&gt;

&lt;P&gt;ifort /Qopenmp test.f90&lt;/P&gt;

&lt;P&gt;The results are:&lt;/P&gt;

&lt;P&gt;The cpu time (s) by the sequential code is&amp;nbsp;&amp;nbsp; 1.98437500000000&lt;BR /&gt;
	The cpu time (s) cost by the parallel code is&amp;nbsp;&amp;nbsp; 3.85937500000000&lt;/P&gt;

&lt;P&gt;The parallel code is much slower than the sequential code. It seems challenging to find out&amp;nbsp;the reason!&lt;/P&gt;

&lt;P&gt;I will greatly appreciate your contribution in this problem.&lt;/P&gt;

&lt;P&gt;Thanks a lot.&lt;/P&gt;</description>
      <pubDate>Sun, 07 May 2017 02:40:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100152#M126582</guid>
      <dc:creator>jingbo_W_</dc:creator>
      <dc:date>2017-05-07T02:40:26Z</dc:date>
    </item>
    <item>
      <title>You leave opportunities for</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100153#M126583</link>
      <description>&lt;P&gt;You leave opportunities for the compiler to take shortcuts, more so in the sequential case.&lt;/P&gt;

&lt;P&gt;Cpu_time adds up the times devoted to all threads. &amp;nbsp; The usual goal of parallelism is to reduce elapsed time by splitting increased cpu time among threads. Omp_get_wtime measures elapsed time. &amp;nbsp;System_clock works well on linux or with gfortran.&lt;/P&gt;

&lt;P&gt;By omitting the simd clause, you are suggesting to the compiler to drop simd vectorization when introducing threading. &amp;nbsp;If you use intel fortran, the opt_report.would give useful information.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If you have hyperthreading and intel openmp, you should find it useful to set 1 thread per core by omp_num_threads and omp_places=cores&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 07 May 2017 04:38:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100153#M126583</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-05-07T04:38:42Z</dc:date>
    </item>
    <item>
      <title>Hello Tim,</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100154#M126584</link>
      <description>&lt;P&gt;Hello Tim,&lt;/P&gt;

&lt;P&gt;could you pl. post an example for "omp_places=cores "? Is "omp_places" an environment variable? Is "cores" to be replaced by a number, e.g. omp_places=4?&lt;/P&gt;

&lt;P&gt;I also saw "export omp_places=cores" somewhere. Where, when and how to activate this?&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;</description>
      <pubDate>Sun, 07 May 2017 10:57:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100154#M126584</guid>
      <dc:creator>Johannes_A_</dc:creator>
      <dc:date>2017-05-07T10:57:00Z</dc:date>
    </item>
    <item>
      <title>On my dual core laptop under</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100155#M126585</link>
      <description>&lt;P&gt;On my dual core laptop under cmd using omp environment variables&lt;/P&gt;

&lt;P&gt;Set omp_num_threads=2&lt;/P&gt;

&lt;P&gt;Set omp_places=cores&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 07 May 2017 11:57:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100155#M126585</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-05-07T11:57:08Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100156#M126586</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Dear Tim,&lt;/P&gt;

&lt;P&gt;Thank you very much for your nice comments!&lt;/P&gt;

&lt;P&gt;I followed your suggestions, modified the code(see the figure below or the attached file) and build the application by&lt;/P&gt;

&lt;P&gt;ifort /Qopenmp /Qopt-report test.f90&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="code_test.jpg"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/9530iBA1D906B08A85281/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="code_test.jpg" alt="code_test.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;The output file test.optrpt reports the following(my question is marked in bold)&lt;/P&gt;

&lt;DIV&gt;Intel(R) Advisor can now assist with vectorization and show optimization&lt;BR /&gt;
	&amp;nbsp; report messages with your source code.&lt;BR /&gt;
	See "https://software.intel.com/en-us/intel-advisor-xe" for details.&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Report from: Interprocedural optimizations [ipo]&lt;/DIV&gt;

&lt;DIV&gt;INLINING OPTION VALUES:&lt;BR /&gt;
	&amp;nbsp; -Qinline-factor: 100&lt;BR /&gt;
	&amp;nbsp; -Qinline-min-size: 30&lt;BR /&gt;
	&amp;nbsp; -Qinline-max-size: 230&lt;BR /&gt;
	&amp;nbsp; -Qinline-max-total-size: 2000&lt;BR /&gt;
	&amp;nbsp; -Qinline-max-per-routine: 10000&lt;BR /&gt;
	&amp;nbsp; -Qinline-max-per-compile: 500000&lt;/DIV&gt;

&lt;DIV&gt;&lt;STRONG&gt;What are the inline option values?&lt;/STRONG&gt;&lt;BR /&gt;
	Begin optimization report for: TEST&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Report from: Interprocedural optimizations [ipo]&lt;/DIV&gt;

&lt;DIV&gt;INLINE REPORT: (TEST) [1] C:\HPC3D\SOR\test.f90(1,9)&lt;/DIV&gt;

&lt;DIV&gt;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp; Report from: OpenMP optimizations [openmp]&lt;/DIV&gt;

&lt;DIV&gt;C:\HPC3D\SOR\test.f90(24:8-24:8):OMP:MAIN__:&amp;nbsp; OpenMP DEFINED LOOP WAS PARALLELIZED&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Report from: Loop nest, Vector &amp;amp; Auto-parallelization optimizations [loop, vec, par]&lt;/DIV&gt;

&lt;DIV&gt;&lt;BR /&gt;
	LOOP BEGIN at C:\HPC3D\SOR\test.f90(11,1)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; remark #25408: memset generated&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; remark #15542: loop was not vectorized: inner loop was already vectorized&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp; LOOP BEGIN at C:\HPC3D\SOR\test.f90(11,1)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15300: LOOP WAS VECTORIZED&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; LOOP END&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp; LOOP BEGIN at C:\HPC3D\SOR\test.f90(11,1)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;lt;Remainder loop for vectorization&amp;gt;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; LOOP END&lt;BR /&gt;
	LOOP END&lt;/DIV&gt;

&lt;DIV&gt;LOOP BEGIN at C:\HPC3D\SOR\test.f90(16,8)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; remark #15542: loop was not vectorized: inner loop was already vectorized&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp; LOOP BEGIN at C:\HPC3D\SOR\test.f90(15,2)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; remark #15300: LOOP WAS VECTORIZED&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; LOOP END&lt;BR /&gt;
	LOOP END&lt;/DIV&gt;

&lt;DIV&gt;&lt;STRONG&gt;It tells that the outer loop was not vectorized but the inner one was vectorized?&lt;/STRONG&gt;&lt;/DIV&gt;

&lt;DIV&gt;LOOP BEGIN at C:\HPC3D\SOR\test.f90(25,2)&lt;BR /&gt;
	&amp;lt;Peeled loop for vectorization&amp;gt;&lt;BR /&gt;
	LOOP END&lt;/DIV&gt;

&lt;DIV&gt;LOOP BEGIN at C:\HPC3D\SOR\test.f90(25,2)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; remark #15300: LOOP WAS VECTORIZED&lt;BR /&gt;
	LOOP END&lt;/DIV&gt;

&lt;DIV&gt;LOOP BEGIN at C:\HPC3D\SOR\test.f90(25,2)&lt;BR /&gt;
	&amp;lt;Remainder loop for vectorization&amp;gt;&lt;BR /&gt;
	LOOP END&lt;/DIV&gt;

&lt;DIV&gt;&lt;BR /&gt;
	Non-optimizable loops:&lt;/DIV&gt;

&lt;DIV&gt;&lt;BR /&gt;
	LOOP BEGIN at C:\HPC3D\SOR\test.f90(29,1)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.&amp;nbsp;&amp;nbsp; [ C:\HPC3D\SOR\test.f90(24,8) ]&lt;/DIV&gt;

&lt;DIV&gt;&lt;STRONG&gt;Why?&lt;/STRONG&gt;&lt;BR /&gt;
	LOOP END&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Report from: Code generation optimizations [cg]&lt;/DIV&gt;

&lt;DIV&gt;C:\HPC3D\SOR\test.f90(11,1):remark #34026: call to memset implemented as a call to optimized library version&lt;/DIV&gt;

&lt;DIV&gt;&lt;STRONG&gt;What does it mean?&lt;/STRONG&gt;&lt;BR /&gt;
	===========================================================================&lt;/DIV&gt;

&lt;DIV&gt;The results of the modified program are:&lt;/DIV&gt;

&lt;DIV&gt;The cpu time (s) by the sequential code is&amp;nbsp;&amp;nbsp; 1.24195449252147&lt;BR /&gt;
	The cpu time (s) by the parallel code is&amp;nbsp;&amp;nbsp; 1.49823977670167&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;The parallel code is still slower than the sequential code, which seems strange.&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;Thanks in advance!&lt;/DIV&gt;</description>
      <pubDate>Sun, 07 May 2017 16:10:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100156#M126586</guid>
      <dc:creator>jingbo_W_</dc:creator>
      <dc:date>2017-05-07T16:10:19Z</dc:date>
    </item>
    <item>
      <title>Your parallel code is slow</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100157#M126587</link>
      <description>&lt;P&gt;Your parallel code is slow because you start up the threading and close it down 40000 times.&lt;/P&gt;

&lt;P&gt;Try this:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;program test
use omp_lib   
implicit none

REAL(8)::time1,time2
real(8),allocatable,dimension(:)::b
integer:: k,j,i

j=70000
allocate(b(j))
b=0.0D+00

time1=OMP_get_wtime()
do k=1,40000
	do i=1,j
		b(i)=real(i,kind=8)
	enddo
enddo
time2=OMP_get_wtime()
write(*,*)'The cpu time (s) by the sequential code is',time2-time1

time1=OMP_get_wtime()

!$OMP PARALLEL DO PRIVATE(b,i)
do k=1,40000
	do i=1,j
		b(i)=real(i,kind=8)
	enddo
enddo
!$OMP END PARALLEL DO

time2=OMP_get_wtime()
write(*,*)'The cpu time (s) by the parallel code is',time2-time1
pause
end program test&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 07 May 2017 16:51:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100157#M126587</guid>
      <dc:creator>Andrew_Smith</dc:creator>
      <dc:date>2017-05-07T16:51:43Z</dc:date>
    </item>
    <item>
      <title>In your example, you don't</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100158#M126588</link>
      <description>&lt;P&gt;In your example, you don't have any inlineable code, so the inline limits aren't remotely approached.&lt;/P&gt;

&lt;P&gt;In your cases, if the compiler were to examine the outer loops with a view toward vectorization collapse (making 1 loop out of 2) it would probably lead to skipping the outer loop iterations whose results are over-written immediately, exposing a fallacy in this method for assessing performance.&amp;nbsp; As such an optimization may occur in non-parallel code but be suppressed by a parallel directive, it is of particular concern for the kind of conclusion you are trying to draw.&lt;/P&gt;

&lt;P&gt;The function calls referred to in opt-report must be those library calls inserted by expansion of the omp directives.&amp;nbsp; Evidently, they aren't vectorizable, and may be protected against in-lining, knowing that it would be counter-productive.&lt;/P&gt;

&lt;P&gt;The compiler has made memset library function calls to zero out arrays.&amp;nbsp; In the context of your example, the effect is practically indistinguishable from local vectorized code expansion.&amp;nbsp; The remark is useful to assure you that the compiler didn't decide to skip the code, even though no locally vectorized loop is generated.&lt;/P&gt;

&lt;P&gt;The remark about a peeled loop being generated in your parallel region may have a bearing on the extra time taken there.&amp;nbsp; You would want to check whether adding flags such as -QxHost -align array32byte have an effect.&amp;nbsp; The peeled loop might cause one thread to take extra time.&lt;/P&gt;</description>
      <pubDate>Sun, 07 May 2017 18:48:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100158#M126588</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-05-07T18:48:16Z</dc:date>
    </item>
    <item>
      <title>Andrew's code may need</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100159#M126589</link>
      <description>&lt;P&gt;Andrew's code may need modification to meet your needs&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;...

!$OMP PARALLEL PRIVATE(i)
do k=1,40000
&amp;nbsp; !$OMP DO
&amp;nbsp; do i=1,j
&amp;nbsp;&amp;nbsp;&amp;nbsp; b(i)=real(i,kind=8)
&amp;nbsp; enddo
enddo
!$OMP END PARALLEL
...&lt;/PRE&gt;

&lt;P&gt;The reason I say "may need modification" is this is would depend on what your actual code is doing as opposed to what the sketch code you provided is doing:&lt;/P&gt;

&lt;P&gt;If you have 40000 "things" (objects, jobs, entities), each separate calculations, then Andrew's suggestion is correct.&lt;BR /&gt;
	If you have one "thing", iterating 40000 times (e.g. simulation advancing through time), then the method above would be correct.&lt;/P&gt;

&lt;P&gt;Tim's comments should be read and considered. There are some "gotcha's" you can fall into when first exploring parallelization. CPU_TIME vs elapse time, compiler optimization eliding code generating unused results, compiler optimization removing unnecessary iterations of loops, compiler optimizations producing results calculable at compile time, ...&amp;nbsp; and then there are naïve expectations:&amp;nbsp;no overhead for region entry, no overhead for region distribution, no overhead/interference for memory bus and cache resources.&lt;/P&gt;

&lt;P&gt;All of the comments posted here are intended to help you through your learning experience.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 May 2017 12:08:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100159#M126589</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-05-08T12:08:16Z</dc:date>
    </item>
    <item>
      <title>Andrew and Jim are correct in</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100160#M126590</link>
      <description>&lt;P&gt;Andrew and Jim are correct in principle that entering a parallel region repeatedly could be expensive, and this could be checked by the method Jim showed.&amp;nbsp; Widely used implementations of OpenMP, including Intel's, minimize the penalty for repeated entries by keeping the thread pool open for an interval.&amp;nbsp; Setting KMP_BLOCKTIME=0 in the original version might demonstrate the full penalty for repeated entry to parallel region.&lt;/P&gt;</description>
      <pubDate>Mon, 08 May 2017 13:45:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100160#M126590</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-05-08T13:45:41Z</dc:date>
    </item>
    <item>
      <title>Dear Jim,</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100161#M126591</link>
      <description>&lt;P&gt;Dear Jim,&lt;/P&gt;

&lt;P&gt;Your suggestions are really helpful. In my model, the outer iterations&amp;nbsp;are sequenced and are not suitable for parallelization.&lt;/P&gt;

&lt;P&gt;I am trying to fully understanding&amp;nbsp;your and Tim's comments. Now, the results are improved and as follows:&lt;/P&gt;

&lt;P&gt;The&amp;nbsp;elapse time (s) by the sequential code is&amp;nbsp;&amp;nbsp; 1.28671266883248&lt;BR /&gt;
	The&amp;nbsp;elapse time (s) by the parallel code is&amp;nbsp; 0.742385344517970&lt;/P&gt;

&lt;P&gt;The speedup factor is about 1.7.&amp;nbsp; My CPU is Intel i5-4200U, which has 1 socket, 2 cores and 4 logical processors. My naive exception of the speedup factor is about 4. However, the present one is only 1.7. This phenomenon could be due to overhead cost/interference you mentioned.&amp;nbsp;Is there&amp;nbsp;any other reason?&amp;nbsp;What is a reasonable speedup factor?&lt;/P&gt;

&lt;P&gt;Can you suggest some literatures for understanding the overhead for region entry, overhead for region distribution, overhead/interference for memory bus and cache resources?&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Jingbo&lt;/P&gt;</description>
      <pubDate>Mon, 08 May 2017 14:24:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100161#M126591</guid>
      <dc:creator>jingbo_W_</dc:creator>
      <dc:date>2017-05-08T14:24:15Z</dc:date>
    </item>
    <item>
      <title>Jingbo.</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100162#M126592</link>
      <description>&lt;P&gt;Jingbo.&lt;/P&gt;

&lt;P&gt;Your test program is essentially memory (write) bandwidth limited. Very little computation between writes. Your CPU, has 2 cores, 4 threads. More importantly it has 2 memory channels. Memory bandwidth limited applications tend to scale by the number of memory channels as opposed to number of hardware threads. Floating point intensive applications&amp;nbsp;(that are not memory intensive) tend to scale by the number of cores (vector units). Scalar intensive (that are not memory intensive) tend to scale by the number of hardware threads (though to a lesser extent for applications with larger cache utilization). There is no such thing as a typical application. Applications in general with have a mix "intensiveness". You will have to take the knowledge you learn from experience (currently at beginner level), use that to determine where/how to address issues of improving opportunities for vectorization and where/how to parallelize the code. This (learning experience) will take time with some experimentation on your part. The users on this forum will assist you, and assist you best when you show initiative with advice given.&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&amp;nbsp;the outer iterations&amp;nbsp;are sequenced and are not suitable for parallelization&lt;/P&gt;

&lt;P&gt;Do not assume this without complete understanding of what is being performed. Your code could be suitable for:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;Preamble | YourParallelCode | Postamble
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Preamble&amp;nbsp; | YourParallelCode | Postamble
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Preamble&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; | YourParallelCode | Postamble&lt;/PRE&gt;

&lt;P&gt;Where the preamble and postamble sections may be doing things like file i/o or&amp;nbsp;state propagation.&lt;/P&gt;

&lt;P&gt;IOW you can use parallelization to overlap the preamble/postamble with the (parallel) compute section.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 08 May 2017 15:20:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100162#M126592</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-05-08T15:20:13Z</dc:date>
    </item>
    <item>
      <title>The above description is</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100163#M126593</link>
      <description>&lt;P&gt;The above description is called a parallel pipeline. In this case the preamble and postable are run sequentially, and concurrently using two threads, concurrent with running the inner parallel section. Barring overhead, total runtime estimate for 40000 iterations&lt;/P&gt;

&lt;P&gt;1 Preamble time + 40000 parallel time + 1 postamble time.&lt;/P&gt;

&lt;P&gt;As opposed to (not suitable for parallelization)&lt;/P&gt;

&lt;P&gt;40000&amp;nbsp;Preamble time + 40000 parallel time +&amp;nbsp;40000 postamble time.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 08 May 2017 15:29:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100163#M126593</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-05-08T15:29:11Z</dc:date>
    </item>
    <item>
      <title>As this example relies on</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100164#M126594</link>
      <description>&lt;P&gt;As this example relies on floating point instructions, particularly if you find best performance with 1 thread for each of 2 cores, the threaded speedup of 1.7 is fairly typical.&amp;nbsp; I have the same CPU here.&amp;nbsp; You should see a significant advantage for setting /QxHost vs. omitting that option, but it may reduce the threading speedup somewhat.&lt;/P&gt;</description>
      <pubDate>Mon, 08 May 2017 20:05:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Why-my-parallel-code-by-OpenMP-is-much-slower-than-the/m-p/1100164#M126594</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-05-08T20:05:26Z</dc:date>
    </item>
  </channel>
</rss>

