<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Ideal vectorization speed-up with SSE2 and MIC512 - not AVX? in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017231#M3972</link>
    <description>&lt;P&gt;Hi&lt;/P&gt;

&lt;P&gt;In the process of optimizing a large Fortran research code I have written a simple program that very closely resembles the performance characteristics of the more complicated case. The code essentially ends up spending all its time evaluating exponential functions and square roots in a vectorizable manner, so it is a compute bound problem that should be extremely well suited for Xeon phi and wide vector units in general.&lt;/P&gt;

&lt;P&gt;By running the program below I obtained vectorized and unvectorized performance results for SSE3/AVX/Xeon phi compilations. The "funny" thing is that I get virtually ideal vectorization speed-up for SSE3 and on the Xeon Phi, but not for AVX. I am using the latest version of parallel studio on Windows and I run the program on a Xeon E5-2650 v2 with a Xeon Phi 3120. Performance numbers from running the attached code follows below ...&lt;/P&gt;

&lt;P&gt;Any idea why the AVX speed-up is so far from ideal?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Any idea why I am not seeing better performance from the Xeon Phi? My code is clearly compute bound, embarrasingly parallel, uses aligned vector instructions, no allocations, yet the Xeon Phi is only 3x faster than the host cpu running off SSE3 instructions. For SSE3 instructions the peak performance of the host cpu should be 166 GFLOPS versus 1000 GFLOPS for the phi. So I would expect something more in line with a 6x difference?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thank you very much in advance for you advice!&lt;/P&gt;

&lt;P&gt;C&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;CPU SSE3 (NoVec/Vec): 437/871 -&amp;gt; 2.0x vectorization speed-up&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;CPU AVX (NoVec/Vec): 437/1194 -&amp;gt; &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;2.7x vectorization speed-up&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;Xeon Phi (NoVec/Vec): 343/2591 -&amp;gt; 7.6x&amp;nbsp;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;vectorization speed-up&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;module mComputations&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; real*8,dimension(-57:50) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: RAll&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(-55:16) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: LambdaAll,LambdaAll2&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(-55:16) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: Ri,Ei,Fi,Hi,Un,Ui&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(-55:16,1:20) &amp;nbsp; :: RefAll,UAll,E2&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(1:20) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: FiltResp &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll&lt;BR /&gt;
	!DIR$ ATTRIBUTES ALIGN : 64 :: &amp;nbsp;LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll&lt;BR /&gt;
	!$OMP THREADPRIVATE(LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll) &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; contains&lt;BR /&gt;
	!dir$ attributes offload : mic :: DoComputations &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; subroutine DoComputations(iNoModels,iUseVectorization)&lt;BR /&gt;
	! &amp;nbsp; Input:&lt;BR /&gt;
	! &amp;nbsp; &amp;nbsp;iNoModels &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; Number of models to calculate. Set high for good statistics on the benchmark.&lt;BR /&gt;
	! &amp;nbsp; &amp;nbsp;iUseVectorization -&amp;gt; If true, the benchmark is run with aligned vector instructions. False, no vectorization.&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; use omp_lib&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; implicit none&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; integer, intent(in) :: iNoModels&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; logical, intent(in) :: iUseVectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; integer :: I2,I,J,IJMinCalc,IJMaxCalc,ijmaxloop,ijminloop,NoModels,t,k,Models&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: SMy,Rs,ki2,NLayM,time,E&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: Nom,Denom&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: Exparg&lt;BR /&gt;
	!DIR$ ATTRIBUTES ALIGN : 64 :: &amp;nbsp;Thick&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; if (iUseVectorization) then&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; print *,'Vectorized:'&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; else&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; print *,'Unvectorized:' &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end if &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; NLayM=30&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; Sigma(:)=0.1&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; Thick(:)=2.5&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMinCalc=-55&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMaxCalc=16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMinLoop=1&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMaxLoop=ijMaxCalc-ijmincalc+1&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ! Variables&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; TimeBegin=omp_get_wtime()&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !Loop over models&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; NoModels=iNoModels&lt;BR /&gt;
	!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(iUseVectorization,NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc,ijMinLoop,ijMaxLoop)&lt;BR /&gt;
	!$OMP DO&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; do Models=1,NoModels&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; do t=1,31&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; time=log(2d0)/(1e-6*10**((t-1d0)/10d0)) &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; E=10**0.1&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; if (iUseVectorization==.false.) then &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	!DIR$ NO VECTOR &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I = ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Val = E**(I)*0.1d0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; lambdaAll2(I) = Val*Val&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; !Loop over frequencies - 16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do k=1,16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; SMy=4*3.14e-7*time &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ! start from the lowest layer&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; kn2 = Smy*real(sigma(NLayM))&lt;BR /&gt;
	!DIR$ NO VECTOR&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = sqrt(LambdaAll2(J)+kn2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = 0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; rs = SMy*(sigma(I2)-sigma(I2+1))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ki2 = Smy*real(sigma(I2))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;!DIR$ NO VECTOR &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; !The critical loop is here!&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ui(J) = sqrt(LambdaAll2(J)+ki2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Hi(J) = Ui(J)+Un(J)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ri(J) = rs/(Hi(J)*Hi(J))&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; exparg = -2.d0*ui(j)*Thick(I2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ei(J) = exp(exparg)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; nom = (Ei(J)*(Ri(J)+Fi(J)))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; denom = (1.d0+Ri(J)*Fi(J))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = nom/denom&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = Ui(J) &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; else&lt;BR /&gt;
	!DIR$ VECTOR ALIGNED &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I = ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Val = E**(I)*0.1d0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; lambdaAll2(I) = Val*Val&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; !Loop over frequencies - 16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do k=1,16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; SMy=4*3.14e-7*time &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ! start from the lowest layer&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; kn2 = Smy*real(sigma(NLayM))&lt;BR /&gt;
	!DIR$ VECTOR ALIGNED &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = sqrt(LambdaAll2(J)+kn2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = 0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; rs = SMy*(sigma(I2)-sigma(I2+1))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ki2 = Smy*real(sigma(I2))&lt;/P&gt;

&lt;P&gt;!DIR$ VECTOR ALIGNED &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; !The critical loop is here!&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ui(J) = sqrt(LambdaAll2(J)+ki2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Hi(J) = Ui(J)+Un(J)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ri(J) = rs/(Hi(J)*Hi(J))&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; exparg = -2.d0*ui(j)*Thick(I2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ei(J) = exp(exparg)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; nom = (Ei(J)*(Ri(J)+Fi(J)))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; denom = (1.d0+Ri(J)*Fi(J))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = nom/denom&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = Ui(J) &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end if&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; end do&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end do&lt;BR /&gt;
	!$OMP END DO&lt;BR /&gt;
	!$OMP END PARALLEL &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; TimeEnd=omp_get_wtime()&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; print *,'Models/s=',NoModels*1d0/(TimeEnd-TimeBegin)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end subroutine DoComputations&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end module mComputations &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; program kernelopt&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; use mComputations&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; implicit none&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; integer :: i,j&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real :: Depth(50),Values(50)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; logical :: UseVectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !!set enviroment variable KMP_AFFINITY=scatter before running&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !Perform the same calculation with and without vectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; print *,'CPU benchmark'&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call omp_set_num_threads(8)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(8*4000,.true.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(8*1000,.false.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !stop&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; print *,'Xeon phi benchmark'&amp;nbsp;&lt;BR /&gt;
	!DIR$ OFFLOAD BEGIN TARGET(mic:0)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call omp_set_num_threads(224)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(224*800,.true.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(224*100,.False.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	!DIR$ END OFFLOAD&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end program kernelopt&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 09 Apr 2015 14:48:20 GMT</pubDate>
    <dc:creator>PKM</dc:creator>
    <dc:date>2015-04-09T14:48:20Z</dc:date>
    <item>
      <title>Ideal vectorization speed-up with SSE2 and MIC512 - not AVX?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017231#M3972</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;

&lt;P&gt;In the process of optimizing a large Fortran research code I have written a simple program that very closely resembles the performance characteristics of the more complicated case. The code essentially ends up spending all its time evaluating exponential functions and square roots in a vectorizable manner, so it is a compute bound problem that should be extremely well suited for Xeon phi and wide vector units in general.&lt;/P&gt;

&lt;P&gt;By running the program below I obtained vectorized and unvectorized performance results for SSE3/AVX/Xeon phi compilations. The "funny" thing is that I get virtually ideal vectorization speed-up for SSE3 and on the Xeon Phi, but not for AVX. I am using the latest version of parallel studio on Windows and I run the program on a Xeon E5-2650 v2 with a Xeon Phi 3120. Performance numbers from running the attached code follows below ...&lt;/P&gt;

&lt;P&gt;Any idea why the AVX speed-up is so far from ideal?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Any idea why I am not seeing better performance from the Xeon Phi? My code is clearly compute bound, embarrasingly parallel, uses aligned vector instructions, no allocations, yet the Xeon Phi is only 3x faster than the host cpu running off SSE3 instructions. For SSE3 instructions the peak performance of the host cpu should be 166 GFLOPS versus 1000 GFLOPS for the phi. So I would expect something more in line with a 6x difference?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thank you very much in advance for you advice!&lt;/P&gt;

&lt;P&gt;C&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;CPU SSE3 (NoVec/Vec): 437/871 -&amp;gt; 2.0x vectorization speed-up&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;CPU AVX (NoVec/Vec): 437/1194 -&amp;gt; &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;2.7x vectorization speed-up&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;Xeon Phi (NoVec/Vec): 343/2591 -&amp;gt; 7.6x&amp;nbsp;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;vectorization speed-up&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;module mComputations&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; real*8,dimension(-57:50) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: RAll&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(-55:16) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: LambdaAll,LambdaAll2&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(-55:16) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: Ri,Ei,Fi,Hi,Un,Ui&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(-55:16,1:20) &amp;nbsp; :: RefAll,UAll,E2&lt;BR /&gt;
	&amp;nbsp; real*8,dimension(1:20) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;:: FiltResp &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll&lt;BR /&gt;
	!DIR$ ATTRIBUTES ALIGN : 64 :: &amp;nbsp;LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll&lt;BR /&gt;
	!$OMP THREADPRIVATE(LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll) &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; contains&lt;BR /&gt;
	!dir$ attributes offload : mic :: DoComputations &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; subroutine DoComputations(iNoModels,iUseVectorization)&lt;BR /&gt;
	! &amp;nbsp; Input:&lt;BR /&gt;
	! &amp;nbsp; &amp;nbsp;iNoModels &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; Number of models to calculate. Set high for good statistics on the benchmark.&lt;BR /&gt;
	! &amp;nbsp; &amp;nbsp;iUseVectorization -&amp;gt; If true, the benchmark is run with aligned vector instructions. False, no vectorization.&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; use omp_lib&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; implicit none&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; integer, intent(in) :: iNoModels&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; logical, intent(in) :: iUseVectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; integer :: I2,I,J,IJMinCalc,IJMaxCalc,ijmaxloop,ijminloop,NoModels,t,k,Models&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: SMy,Rs,ki2,NLayM,time,E&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: Nom,Denom&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real*8 &amp;nbsp;:: Exparg&lt;BR /&gt;
	!DIR$ ATTRIBUTES ALIGN : 64 :: &amp;nbsp;Thick&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; if (iUseVectorization) then&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; print *,'Vectorized:'&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; else&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; print *,'Unvectorized:' &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end if &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; NLayM=30&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; Sigma(:)=0.1&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; Thick(:)=2.5&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMinCalc=-55&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMaxCalc=16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMinLoop=1&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ijMaxLoop=ijMaxCalc-ijmincalc+1&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; ! Variables&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; TimeBegin=omp_get_wtime()&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !Loop over models&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; NoModels=iNoModels&lt;BR /&gt;
	!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(iUseVectorization,NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc,ijMinLoop,ijMaxLoop)&lt;BR /&gt;
	!$OMP DO&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; do Models=1,NoModels&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; do t=1,31&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; time=log(2d0)/(1e-6*10**((t-1d0)/10d0)) &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; E=10**0.1&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; if (iUseVectorization==.false.) then &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	!DIR$ NO VECTOR &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I = ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Val = E**(I)*0.1d0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; lambdaAll2(I) = Val*Val&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; !Loop over frequencies - 16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do k=1,16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; SMy=4*3.14e-7*time &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ! start from the lowest layer&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; kn2 = Smy*real(sigma(NLayM))&lt;BR /&gt;
	!DIR$ NO VECTOR&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = sqrt(LambdaAll2(J)+kn2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = 0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; rs = SMy*(sigma(I2)-sigma(I2+1))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ki2 = Smy*real(sigma(I2))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;!DIR$ NO VECTOR &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; !The critical loop is here!&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ui(J) = sqrt(LambdaAll2(J)+ki2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Hi(J) = Ui(J)+Un(J)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ri(J) = rs/(Hi(J)*Hi(J))&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; exparg = -2.d0*ui(j)*Thick(I2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ei(J) = exp(exparg)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; nom = (Ei(J)*(Ri(J)+Fi(J)))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; denom = (1.d0+Ri(J)*Fi(J))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = nom/denom&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = Ui(J) &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; else&lt;BR /&gt;
	!DIR$ VECTOR ALIGNED &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I = ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Val = E**(I)*0.1d0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; lambdaAll2(I) = Val*Val&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; !Loop over frequencies - 16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do k=1,16&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; SMy=4*3.14e-7*time &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ! start from the lowest layer&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; kn2 = Smy*real(sigma(NLayM))&lt;BR /&gt;
	!DIR$ VECTOR ALIGNED &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = sqrt(LambdaAll2(J)+kn2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = 0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; rs = SMy*(sigma(I2)-sigma(I2+1))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ki2 = Smy*real(sigma(I2))&lt;/P&gt;

&lt;P&gt;!DIR$ VECTOR ALIGNED &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do J=ijmincalc,ijmaxcalc&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; !The critical loop is here!&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ui(J) = sqrt(LambdaAll2(J)+ki2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Hi(J) = Ui(J)+Un(J)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ri(J) = rs/(Hi(J)*Hi(J))&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; exparg = -2.d0*ui(j)*Thick(I2)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Ei(J) = exp(exparg)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; nom = (Ei(J)*(Ri(J)+Fi(J)))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; denom = (1.d0+Ri(J)*Fi(J))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fi(J) = nom/denom&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Un(J) = Ui(J) &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; enddo&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end if&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; end do&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end do&lt;BR /&gt;
	!$OMP END DO&lt;BR /&gt;
	!$OMP END PARALLEL &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; TimeEnd=omp_get_wtime()&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; print *,'Models/s=',NoModels*1d0/(TimeEnd-TimeBegin)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end subroutine DoComputations&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end module mComputations &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; program kernelopt&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; use mComputations&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; implicit none&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; integer :: i,j&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; real :: Depth(50),Values(50)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; logical :: UseVectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !!set enviroment variable KMP_AFFINITY=scatter before running&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !Perform the same calculation with and without vectorization&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; print *,'CPU benchmark'&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call omp_set_num_threads(8)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(8*4000,.true.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(8*1000,.false.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; !stop&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; print *,'Xeon phi benchmark'&amp;nbsp;&lt;BR /&gt;
	!DIR$ OFFLOAD BEGIN TARGET(mic:0)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call omp_set_num_threads(224)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(224*800,.true.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; call DoComputations(224*100,.False.)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	!DIR$ END OFFLOAD&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end program kernelopt&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2015 14:48:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017231#M3972</guid>
      <dc:creator>PKM</dc:creator>
      <dc:date>2015-04-09T14:48:20Z</dc:date>
    </item>
    <item>
      <title>Among possible reasons, if by</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017232#M3973</link>
      <description>&lt;P&gt;Among possible reasons, if by ideal you mean that AVX might have twice the parallelism of SSE3 on an "ideal" CPU:&lt;/P&gt;

&lt;P&gt;&amp;nbsp; simd sqrt and divide have to be split by hardware into 128-bit chunks for Ivy Bridge architecture, so AVX may not be faster than SSE3.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; if your critical loops aren't blocked for L1 cache you have a 128-bit width limitation on L2 access on Ivy Bridge.&lt;/P&gt;

&lt;P&gt;But, you have put NO VECTOR on what you call the critical loop, so you can't expect speedup over SSE3 there.&lt;/P&gt;

&lt;P&gt;The iterative throughput Qprec-div- /Qprec-sqrt- replacement for simd divide and sqrt instructions attempts to overcome some of the limitation of 128-bit chunking of the IEEE instructions, but will not help much if any with latency.&amp;nbsp; You can experiment with /Qimf-accuracy options to reduce number of iterations and accuracy, to improve latency.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2015 15:04:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017232#M3973</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-04-09T15:04:00Z</dc:date>
    </item>
    <item>
      <title>Thanks for your reply -</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017233#M3974</link>
      <description>&lt;P style="font-size: 12px;"&gt;&lt;STRONG&gt;Thanks for your reply - please see comments in bold below :-)&lt;/STRONG&gt;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&lt;SPAN style="line-height: 1.5;"&gt;Among possible reasons, if by ideal you mean that AVX might have twice the parallelism of SSE3 on an "ideal" CPU:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&amp;nbsp; sqrt and divide have to be split by hardware into 128-bit chunks for Ivy Bridge architecture, so AVX may not be faster than SSE3.&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&lt;STRONG&gt;I have also tried compiling it for AVX2 and run it on a Haswell I have at home - same result?&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;if your critical loops aren't blocked for L1 cache you have a 128-bit width limitation on L2 access on Ivy Bridge.&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&lt;STRONG&gt;The code is constantly operating on the same 10 kilobyte of data or so per thread, so wouldn't that work right out of the box?&lt;/STRONG&gt;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;But, you have put NO VECTOR on what you call the critical loop, so you can't expect speedup over SSE3 there.&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&lt;STRONG&gt;There is an IF statement seperating the kernel of the code in two identical paths depending on the value of input &lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;iUseVectorization&lt;/SPAN&gt;&amp;nbsp;- one uses NO VECTOR for all loops, the other uses VECTOR ALIGNED.&lt;/STRONG&gt;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 12px;"&gt;&lt;STRONG&gt;Any comments on the performance I am seeing on the Xeon Phi? 3x faster than SSE3 on an 8 core Ivy Bridge seems very low to me ...?&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2015 15:21:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017233#M3974</guid>
      <dc:creator>PKM</dc:creator>
      <dc:date>2015-04-09T15:21:57Z</dc:date>
    </item>
    <item>
      <title>Square roots, exponentials,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017234#M3975</link>
      <description>&lt;P&gt;Square roots, exponentials, and divides are all relatively expensive and have fairly complex performance characteristics --- especially if full precision is required.&lt;/P&gt;

&lt;P&gt;As Tim Prince noted, the 256-bit AVX divide instruction provides the same throughput as the 128-bit SSE2 divide instruction on Sandy Bridge, Ivy Bridge, and Haswell.&amp;nbsp; This is unchanged in AVX2.&amp;nbsp; Ivy Bridge included an improvement in FP divide throughput over Sandy Bridge, and that improvement is carried forward in Haswell, but on each of those platforms the 256-bit divide instruction provides no improvement in throughput relative to using 128-bit divide instructions.&lt;/P&gt;

&lt;P&gt;Another feature that may hurt AVX/AVX2 performance on Haswell is the lack of support for FP Addition on Port 0.&amp;nbsp;&amp;nbsp; This means that Sandy Bridge, Ivy Bridge, and Haswell all have the same limit of one 256-bit (4x64) AVX FP Add instruction every cycle.&amp;nbsp;&amp;nbsp; Haswell can issue 2 FP multiplies per cycle or 2 FMAs, but for Adds the performance is unchanged.&amp;nbsp;&amp;nbsp; Haswell can produce the effect of an FP Add on Port 0 using an FMA instruction (with one of the arguments to the multiplication set to 1.0), but this comes at the cost of both an extra register and increased latency -- both of which are often critical in the evaluation of complex functions.&lt;/P&gt;

&lt;P&gt;The 3x speedup of the Xeon Phi over the Xeon E5-2650 v2 seems excellent.&amp;nbsp; The Xeon E5-2650 v2 has a peak AVX performance of 198.4 GFLOPS at the maximum all-core Turbo frequency of 3.1 GHz, while the Xeon Phi 3120 has a peak performance of 1003.2 GFLOPS.&amp;nbsp;&amp;nbsp; But the Xeon Phi gets 1/2 of its peak performance from FMA instructions, and if the algorithms cannot use these, the 5:1 performance ratio for peak FLOPS drops to 2.5:1 if you are only computing Adds, Multiplies, or combinations of Adds and Multiplies that can't fit into the FMA instruction format(s).&amp;nbsp;&amp;nbsp; Since you are doing better than 2.5:1, this seems like a good result, not a disappointing one.&lt;/P&gt;

&lt;P&gt;If you want a more detailed understanding, you should simplify the benchmark to look at the performance of square roots, exponentials, and divides independently.&amp;nbsp; You would definitely want to include the Haswell platform in the comparison, since it has FMA support.&amp;nbsp;&amp;nbsp; Since the Xeon Phi has no hardware support for any of these operations, you would definitely need to follow Time Prince's advice and experiment with the precision controls on the Xeon Phi.&amp;nbsp;&amp;nbsp; If I recall correctly, Xeon Phi provides slightly lower precision by default, but there are many options that may change the details of the comparison.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2015 18:40:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017234#M3975</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-04-09T18:40:06Z</dc:date>
    </item>
    <item>
      <title>Thank you very much for your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017235#M3976</link>
      <description>&lt;P&gt;Thank you very much for your feedback ...&lt;/P&gt;

&lt;P&gt;I tried playing around with&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;/Qimf-accuracy , but the effects on performance are very limited. Given that Knights landing is not so far into the future I think I will spend time doing additional optimization once that platform arrives. Based on the publicly available information on Knights Landing I would expect a compute bound code like this to scale with the increase in peak performance over the current MIC generation, ie. a ball park performance increase of 3x. Do you see anything in the already announced information that should make me lower my expectations?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Best regards,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;C&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Apr 2015 09:31:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Ideal-vectorization-speed-up-with-SSE2-and-MIC512-not-AVX/m-p/1017235#M3976</guid>
      <dc:creator>PKM</dc:creator>
      <dc:date>2015-04-10T09:31:14Z</dc:date>
    </item>
  </channel>
</rss>

