*)>>> btw are you "Mr STREAM"

Anastopoulos__Nikos · ‎05-14-2013

Hi,

I have some code where at some point, after doing SSE3 computations with __m128d-typed values, I need to:

a) store a __m128d value into one of the two halves of a __m256d value (not cast it!)

b) paste two __m128d values side-by-side into a __m256d value

Are there any AVX intrinsics to perform these operations?

Thanks in advance,

Nick

SergeyKostrov · ‎05-14-2013

>>...a) store a __m128d value into one of the two halves of a __m256d value (not cast it!) Do you want to broadcast source values of __m128d variable into a __m256d variable? If Yes, the following intrinsic function needs to be used: ... /* * Load with Broadcast * **** VBROADCASTF128 ymm1, m128 * Load floating point values from the source operand and broadcast to all * elements of the destination */ ... extern __m256d __ICL_INTRINCC _mm256_broadcast_pd(__m128d const *); ... If No, explain how exactly you want to store values. Thanks in advance.

bronxzv · ‎05-14-2013

n.anastop wrote:
Are there any AVX intrinsics to perform these operations?

it is simply achieved using _mm256_insertf128_pd

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_insertf128_pd.htm

n.anastop wrote:
a) store a __m128d value into one of the two halves of a __m256d value (not cast it!)

__m256d v256; __m128d v128; // initialization missing

v256 = _mm256_insertf128_pd(v256,v128,0); // insert in low 128-bit lane

v256 = _mm256_insertf128_pd(v256,v128,1); // insert in high 128-bit lane

n.anastop wrote:
b) paste two __m128d values side-by-side into a __m256d value

__m256d v256; __m128d v128Low,v128High; // initialization missing

v256 = _mm256_insertf128_pd(_mm256_castpd128_pd256(v128Low),v128High,1); // copy to both 128-bit lanes

important: if it's not possible to recompile your SSE3 code to VEX.128 AVX you'll need to take care of the SSE to/from AVX transitions by adding the proper _mm256_zeroupper() intrinsics

Anastopoulos__Nikos · ‎05-15-2013

bronxzv wrote:

Quote:

n.anastop wrote:Are there any AVX intrinsics to perform these operations?

it is simply achieved using _mm256_insertf128_pd

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_insertf128_pd.htm

Quote:

n.anastop wrote:a) store a __m128d value into one of the two halves of a __m256d value (not cast it!)

__m256d v256; __m128d v128; // initialization missing

v256 = _mm256_insertf128_pd(v256,v128,0); // insert in low 128-bit lane

v256 = _mm256_insertf128_pd(v256,v128,1); // insert in high 128-bit lane

Quote:

n.anastop wrote:b) paste two __m128d values side-by-side into a __m256d value

__m256d v256; __m128d v128Low,v128High; // initialization missing

v256 = _mm256_insertf128_pd(_mm256_castpd128_pd256(v128Low),v128High,1); // copy to both 128-bit lanes

important: if it's not possible to recompile your SSE3 code to VEX.128 AVX you'll need to take care of the SSE to/from AVX transitions by adding the proper _mm256_zeroupper() intrinsics

Thanks bronxzv! Your solution is all I wanted.

Just a last question: when are the SSE <-> AVX transitions happening, and how can I avoid them? My code uses mixed SSE3 and AVX intrinsics, so I guess the ideal (in terms of performance) would be to run always in AVX mode. Is that possible somehow?

TimP · ‎05-15-2013

With Intel compilers, setting AVX architecture target takes care of the SSE to AVX transitions by promoting SSE intrinsics to AVX-128 and insertion of vzeroupper at function boundaries. For other compilers, you may require as bronzxv suggested to add the _mm256_zeroupper instrinsics explicitly where you shift from SSE to AVX, so as to break dependencies. In my experience, the facilities to take care of it automatically have been satisfactory, so I haven't experienced the other side.

bronxzv · ‎05-15-2013

n.anastop wrote:
My code uses mixed SSE3 and AVX intrinsics, so I guess the ideal (in terms of performance) would be to run always in AVX mode. Is that possible somehow?

Sure it will be the best solution if you have the SSE3 source code with intrinsics such as _mm_addsub_pd, simply recompile your code for AVX-128, for example with the Intel C++ compiler just define the /QxAVX flag and compile, AVX-128 in itself can be a bit faster than SSE3 thanks to the 3-operand instructions and you'll be able to call AVX-256 code without having to worry about the transitions

maybe later you'll be able to replace AVX-128 by AVX-256 using _mm256_addsub_pd and such, step by step in order to minimize the risk of regression

McCalpinJohn · ‎05-15-2013

I found that if the compiler "understands" what you are doing, it is happy to convert SSE intrinsics to AVX intrinsics.

That certainly surprised me when I tried to compare the performance of a sum reduction coded with SSE intrinsics against one coded with AVX intrinsics. When I saw that the performance was the same, I looked at the assembly listing and discovered that the generated code was the same for the two cases.

I re-compiled the code with SSE intrinsics without the "-xAVX" option and got the SSE instructions I expected....

bronxzv · ‎05-15-2013

John D. McCalpin wrote:
That certainly surprised me when I tried to compare the performance of a sum reduction coded with SSE intrinsics against one coded with AVX intrinsics.

Do you mean you were using AVX-128 intrinsics ?, I wasn't aware they even existed in some compiler, which compiler are you using?

McCalpinJohn · ‎05-16-2013

To "bronxzv":

My original code used SSE2 intrinsics to study the performance of the vector summation kernel with various numbers of partial sum variables (in pairs in SSE registers) and various prefetching strategies, while the new code used 256-bit AVX intrinsics so its partial sum variables were in groups of four in AVX registers. (Then I did the whole exercise over again using the 512-bit MIC SIMD instruction set.)

To me this is just a reminder that the compiler intrinsics are not the same as assembly language, so for many of the experiments I started with a "base" assembly code file generated by the compiler and modified it by hand. This was especially easy for fiddling with software prefetching.
One of these days I will get comfortable enough with inline assembly to try generating the entire kernel by that route, but right now I just don't have enough practice with that approach.

bronxzv · ‎05-16-2013

John D. McCalpin wrote:
My original code used SSE2 intrinsics to study the performance of the vector summation kernel with various numbers of partial sum variables (in pairs in SSE registers) and various prefetching strategies, while the new code used 256-bit AVX intrinsics so its partial sum variables were in groups of four in AVX registers.

yes it makes sense but when you compile the SSE2 code with the -xAVX flag it should generate AVX-128 instructions, unlike the 256-bit AVX intrinsics which generate AVX-256 instructions, that's why I don't understand how you can have the same generated code and same performance, maybe your workloads are bandwidth constrained (*) so AVX-256 was not providing better performance and the generated code was looking almost the same on both paths, but one with xmm registers the other one with ymm registers ?

(*) btw are you "Mr STREAM" Mc Calpin ?

McCalpinJohn · ‎05-17-2013

I originally thought that the compiler had taken my 128-bit SSE2 intrinsics and changed them to 256-bit AVX instructions, but I just checked and I was remembering this incorrectly. The compiler changed my 128-bit SSE2 intrinsics to 128-bit AVX instructions, so there was not any semantic change. Removing the "-xAVX" flag resulted in the generation of 128-bit SSE2 instructions as I originally intended.

The code is definitely bandwidth-limited, but the number of concurrent cache misses generated by the hardware depends on both the number of partial sums and on whether memory is accessed in one contiguous stream or via multiple streams (i.e., different 4KiB pages). This is discussed in some detail in a series of five blog posts starting with:
http://blogs.utexas.edu/jdm4372/2010/11/03/optimizing-amd-opteron-memory-bandwidth-part-1-single-thread-read-only/

I have run these tests on Xeon 5600 (Westmere EP), Xeon E5 (Sandy Bridge EP), and Xeon Phi (MIC) processors, but have not written up the results yet. I think I understand the Xeon Phi results now, so it is probably time to publish those reports. One interesting number is my best case single-threaded read bandwidth on Xeon E5-2680 of over 17.3 GB/s (using Version 013), which is more than twice as fast as any of the results reported in the blog.

bronxzv · ‎05-17-2013

John D. McCalpin wrote:
The compiler changed my 128-bit SSE2 intrinsics to 128-bit AVX instructions, so there was not any semantic change. Removing the "-xAVX" flag resulted in the generation of 128-bit SSE2 instructions as I originally intended.

thank you for the clarification

John D. McCalpin wrote:
The code is definitely bandwidth-limited.

This explain why AVX-256 wasn't faster than AVX-128 I have even remarked cases where AVX-256 is slower than AVX-128 for some workloads, there is a very simple example (bandwidth bound with 3:1 load:store ratio) towards the end of this thread http://software.intel.com/en-us/forums/topic/277905 showing this effect, AVX-256 is nearly 2x faster than AVX-128 with 100% L1D hit rate, then its performance advantage decreases with increasing L1D misses and it's even slower than AVX-128 when the workload fits in the L2 cache!

John D. McCalpin wrote:
, but the number of concurrent cache misses generated by the hardware depends on both the number of partial sums and on whether memory is accessed in one contiguous stream or via multiple streams (i.e., different 4KiB pages). This is discussed in some detail in a series of five blog posts starting with:
http://blogs.utexas.edu/jdm4372/2010/11/03/optimizing-amd-opteron-memory...

I'll have a look, thanks

John D. McCalpin wrote:
I have run these tests on Xeon 5600 (Westmere EP), Xeon E5 (Sandy Bridge EP), and Xeon Phi (MIC) processors, but have not written up the results yet. I think I understand the Xeon Phi results now, so it is probably time to publish those reports. One interesting number is my best case single-threaded read bandwidth on Xeon E5-2680 of over 17.3 GB/s (using Version 013), which is more than twice as fast as any of the results reported in the blog.

I have no practical experience with Xeon Phi but I'll be interested to read about your findings, I look forward for your reports

Bernard · ‎05-18-2013

*)>>> btw are you "Mr STREAM" Mc Calpin ?>>>

I think that user jdmccalpin this is Dr. Mc Calpin.

bronxzv · ‎05-20-2013

iliyapolak wrote:
I think that user jdmccalpin this is Dr. Mc Calpin.

sure, as confirmed by the link to his blog, also he's best known as "Dr. Bandwidth" not "Mr STREAM" as I wrongly remembered

Bernard · ‎05-20-2013

bronxzv wrote:

Quote:

iliyapolak wrote: I think that user jdmccalpin this is Dr. Mc Calpin.

sure, as confirmed by the link to his blog, also he's best known as "Dr. Bandwidth" not "Mr STREAM" as I wrongly remembered

That was not so hard.

Moving/merging __m128d values to __m256d ones

Moving/merging m128d values to m256d ones