<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic But if you write const _ in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941128#M3906</link>
    <description>&lt;P&gt;But if you write const __m256i var = something; isnt the compiler free to align/order that value properly in read-only data segment?&lt;BR /&gt;Why would it ever need to use unaligned loads then when it can guarantee that the data will be properly aligned even without explicitly specifying __declspec(align(32))?&lt;/P&gt;
&lt;P&gt;By the way, specifying alignment on __m256i variable doesn't force aligned loads in 13.1 update 5.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 08 Jul 2013 23:11:48 GMT</pubDate>
    <dc:creator>levicki</dc:creator>
    <dc:date>2013-07-08T23:11:48Z</dc:date>
    <item>
      <title>How to extract DWORD from upper half of 256-bit register?</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941110#M3888</link>
      <description>&lt;P&gt;Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?&lt;/P&gt;
&lt;P&gt;Any ideas/tricks to work around this engineering "oversight"?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2013 11:51:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941110#M3888</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-02T11:51:38Z</dc:date>
    </item>
    <item>
      <title>Igor, There are many</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941111#M3889</link>
      <description>Igor, There are many intrinsic functions for extraction in &lt;STRONG&gt;immintrin.h&lt;/STRONG&gt; header file ( search for all places where a word 'extract' is used ). If the instruction you've expected to see is missing why wouldn't you apply a workaround and use what is available now.

I understood that you need to extract signed or unsigned 32-bit values from &lt;STRONG&gt;__m256i&lt;/STRONG&gt; union:

...
typedef union  _MMINTRIN_TYPE(32) &lt;STRONG&gt;__m256i&lt;/STRONG&gt; {
#if !defined(_MSC_VER)
    /*
     * To support GNU compatible intialization with initializers list,
     * make first union member to be of int64 type.
     */
    __int64             m256i_gcc_compatibility[4];
#endif
    __int8              m256i_i8[32];
    __int16             m256i_i16[16];
    &lt;STRONG&gt;__int32             m256i_i32[8]&lt;/STRONG&gt;;
    __int64             m256i_i64[4];
    unsigned __int8     m256i_u8[32];
    unsigned __int16    m256i_u16[16];
    &lt;STRONG&gt;unsigned __int32    m256i_u32[8]&lt;/STRONG&gt;;
    unsigned __int64    m256i_u64[4];
} __m256i;
...

Is that correct?</description>
      <pubDate>Wed, 03 Jul 2013 14:51:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941111#M3889</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-07-03T14:51:00Z</dc:date>
    </item>
    <item>
      <title>Sergey,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941112#M3890</link>
      <description>&lt;P&gt;Sergey,&lt;/P&gt;
&lt;P&gt;What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0.&lt;/P&gt;
&lt;P&gt;Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but that does not translate to a single assembler instruction. You can understand my post as a complaint about non-orthogonality of AVX2 extensions.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Jul 2013 15:37:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941112#M3890</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-03T15:37:40Z</dc:date>
    </item>
    <item>
      <title>Hi Igor,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941113#M3891</link>
      <description>Hi Igor,

&amp;gt;&amp;gt;...What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting
&amp;gt;&amp;gt;DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but
&amp;gt;&amp;gt;&lt;STRONG&gt;that does not translate to a single assembler instruction&lt;/STRONG&gt;. You can understand my post as a complaint about
&amp;gt;&amp;gt;non-orthogonality of AVX2 extensions.

Thanks for the clarification. I'll take a look at Instructions Set Manual and I'm surprised that such extraction is Not available.</description>
      <pubDate>Wed, 03 Jul 2013 23:06:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941113#M3891</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-07-03T23:06:03Z</dc:date>
    </item>
    <item>
      <title>You will notice that is not</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941114#M3892</link>
      <description>&lt;P&gt;You will notice that is not the only one missing instruction.&lt;/P&gt;
&lt;P&gt;The whole AVX business reminds me of extending AX to EAX -- you get access to 32 bits (EAX), 16 bits (AX), but there is no cheap access to the upper 16-bit register half except through shifts and masks. Same with AVX, just instead of 32 and 16 it is 256 and 128.&lt;/P&gt;
&lt;P&gt;Another part where they did not make instruction set orthogonal is parallel bit shift -- does not exist for words and bytes which in my opinion would be the most common use cases.&lt;/P&gt;
&lt;P&gt;Final part of my complaint is that if they already decide not to implement VPEXTRD eax, ymm0, 7 they could at least document the fastest alternative with 2 or 3 instructions instead of having all of us guess and test.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Jul 2013 23:34:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941114#M3892</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-03T23:34:00Z</dc:date>
    </item>
    <item>
      <title>What about these two</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941115#M3893</link>
      <description>What about these two intrinsic functions?

&lt;STRONG&gt;[ immintrin.h ( Intel version ) ]&lt;/STRONG&gt;
...
extern __m128i __ICL_INTRINCC &lt;STRONG&gt;_mm256_extractf128_si256&lt;/STRONG&gt;( __m256i, const int );
...
extern __m128i __ICL_INTRINCC &lt;STRONG&gt;_mm256_extracti128_si256&lt;/STRONG&gt;( __m256i, const int );
...

I think they almost what you need but still don't return a DWORD type.

&lt;STRONG&gt;Note:&lt;/STRONG&gt; Microsoft's version of &lt;STRONG&gt;immintrin.h&lt;/STRONG&gt; doesn't have declaration for the 2nd function, that is &lt;STRONG&gt;_mm256_extracti128_si256&lt;/STRONG&gt;.</description>
      <pubDate>Thu, 04 Jul 2013 00:30:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941115#M3893</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-07-04T00:30:12Z</dc:date>
    </item>
    <item>
      <title>Quote:Igor Levicki wrote</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941116#M3894</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency)&amp;nbsp; 256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;Any ideas/tricks to work around this&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;extracts: depending on your use case a single VPERMD will do the trick (with&amp;nbsp;proper indices&amp;nbsp;in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is&amp;nbsp;VEXTRACTI128 followed by VPEXTRD&lt;/P&gt;
&lt;P&gt;inserts: for your&amp;nbsp;insertions from a GPR&amp;nbsp;I suggest to use a VPINSRD, VINSERTI128 sequence&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 05:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941116#M3894</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-07-04T05:05:00Z</dc:date>
    </item>
    <item>
      <title>Quote:bronxzv wrote:to be</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941117#M3895</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency)&amp;nbsp; 256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;But I disagree!&lt;/P&gt;
&lt;P&gt;While for other instructions doing the same thing in lower and upper lane is essential, INSERT/EXTRACT instructions are a different thing alltogether -- they should not be promoted in the same way. Their purpose is &lt;STRONG&gt;scalar access to vector elements&lt;/STRONG&gt;, not parallel processing so they should just be extended to allow access to all elements.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;extracts: depending on your use case a single VPERMD will do the trick (with&amp;nbsp;proper indices&amp;nbsp;in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is&amp;nbsp;VEXTRACTI128 followed by VPEXTRD&lt;P&gt;&lt;/P&gt;
&lt;P&gt;inserts: for your&amp;nbsp;insertions from a GPR&amp;nbsp;I suggest to use a VPINSRD, VINSERTI128 sequence&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Yes, I figured that out but still it would be better if the set was made orthogonal to begin with. I see no good reason not to expand PEXTRD/PINSRD to allow indices from 4 to 7.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 08:31:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941117#M3895</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-04T08:31:06Z</dc:date>
    </item>
    <item>
      <title>Quote:Igor Levicki wrote:But</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941118#M3896</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;But I disagree!&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;the choice was&amp;nbsp;probably done to simplify hardware design more than programmer's convenience, one can also argue that pack/unpack isn't convenient the way it was expanded to 256-bit or that 128-bit shifts aren't promoted to 256-bit shifts which isn't "orthogonal"&lt;/P&gt;
&lt;P&gt;all in all I'll say that VPERMD is more convenient than legacy extracts since the element index can be set dynamically (ymm idx register) instead&amp;nbsp;of statically (immediate value), it is incredibly useful for a lot of other use cases, I found a new use for it yesterday for example: dynamically specified broadcast, unlike native broadcast where the low element is replicated you can specify the index of the element to be replicated&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 09:21:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941118#M3896</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-07-04T09:21:00Z</dc:date>
    </item>
    <item>
      <title>I wonder... did you manage to</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941119#M3897</link>
      <description>&lt;P&gt;I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?&lt;/P&gt;
&lt;P&gt;I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 09:39:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941119#M3897</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-04T09:39:05Z</dc:date>
    </item>
    <item>
      <title>Quote:Igor Levicki wrote:I</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941120#M3898</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;actually the max theoretical speedup is 2x i.e. 100% (even more with new instructions like VPERMD)&amp;nbsp;but I have no single&amp;nbsp;test with only integer instructions so I can't&amp;nbsp;report any real world values for integer only, the best speedup I measured with production code is 1.82x (82%) for mixed int and fp when comparing a SSE2 path with an AVX2 path (incl. FMA), note that this is for a single kernel with high L1D cache locality, not a full application&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;my&amp;nbsp;"simplified design" remark was for the two fully distinct&amp;nbsp;execution stacks with duplicated 128-bit execution units, it has nothing to do with any&amp;nbsp;throughput limitation, your deceptive speedup may be due to incomplete vectorization (hint: you mentioned scalar inserts/extracts as important for you so&amp;nbsp;I suppose&amp;nbsp;they are used&amp;nbsp;in some of your hotspots) or L2$/LLC$/memory bandwidth limitation (or both)&lt;/P&gt;
&lt;P&gt;if you want better optimization advices I'll suggest to post&amp;nbsp;code snippets of your hotspots&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 10:14:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941120#M3898</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-07-04T10:14:00Z</dc:date>
    </item>
    <item>
      <title>When I said 50% I actually</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941121#M3899</link>
      <description>&lt;P&gt;When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.&lt;/P&gt;
&lt;P&gt;Attached is the code with simple test driver. My results are:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; test_C : 6345.035 ms&lt;BR /&gt;test_SSE4.1 : 3944.771 ms&lt;BR /&gt;&amp;nbsp; test_AVX2 : 2190.420 ms&lt;/P&gt;
&lt;P&gt;Difference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.&lt;/P&gt;
&lt;P&gt;The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.&lt;/P&gt;
&lt;P&gt;Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu. I think I will just go back to using pure assembler and living with a nightmare of maintaining two versions of ASM code for 32-bit and 64-bit rather then letting compiler do whatever it wants with intrinsics.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 11:19:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941121#M3899</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-04T11:19:58Z</dc:date>
    </item>
    <item>
      <title>Quote:Igor Levicki wrote:When</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941122#M3900</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;so the 33% you were mentioning stands for a x1.49 speedup as per this definition &lt;A href="http://en.wikipedia.org/wiki/Speedup"&gt;http://en.wikipedia.org/wiki/Speedup&lt;/A&gt;&amp;nbsp;this looks pretty good already&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;Attached is the code with simple test driver. My results are:&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; test_C : 6345.035 ms&lt;BR /&gt;test_SSE4.1 : 3944.771 ms&lt;BR /&gt;&amp;nbsp; test_AVX2 : 2190.420 ms&lt;/P&gt;
&lt;P&gt;Difference is 1.80x here too,&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;a 1.80x speedup looks very good to me, there is maybe not much room for improvement, probably nothing obvious I suppose&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;this CPI looks indeed very good, so I suppose your optimizations are already well done&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt; Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu.&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;because the encoding is more compact&amp;nbsp;AFAIK (so potentially slightly less uopcache/icache misses on a big application), besides second order effect like icache misses&amp;nbsp;vmovdqu speed is exactly the same than vmovdqa, note that it is the same with vmovups preferred (by the Intel compiler) over vmovaps for fp code&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2013 12:07:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941122#M3900</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-07-04T12:07:00Z</dc:date>
    </item>
    <item>
      <title>&gt; Finally, I don't understand</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941123#M3901</link>
      <description>&lt;P&gt;&amp;gt; Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned&lt;/P&gt;
&lt;P&gt;AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jul 2013 07:55:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941123#M3901</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2013-07-08T07:55:24Z</dc:date>
    </item>
    <item>
      <title>Quote:andysem wrote:AFAIK, in</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941124#M3902</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;andysem wrote:&lt;BR /&gt;AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jul 2013 09:02:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941124#M3902</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-08T09:02:20Z</dc:date>
    </item>
    <item>
      <title>There's no un-aligned penalty</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941125#M3903</link>
      <description>&lt;P&gt;There's no un-aligned penalty upon SB, IB, and HW (for 128-bit loads), so long as you're within the same cacheline. &amp;nbsp;When you have a memory access that spans a cacheline or a page you take a significant hit in latency of ~5 and ~28 clks on that load. &amp;nbsp;So.. as long as you don't span across cachelines or pages.. you're loads, whether aligned or unaligned in SSE/AVX.. will not take longer.&lt;/P&gt;
&lt;P&gt;Perfwise&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jul 2013 13:04:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941125#M3903</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2013-07-08T13:04:58Z</dc:date>
    </item>
    <item>
      <title>Quote:Igor Levicki wrote:Well</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941126#M3904</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Igor Levicki wrote:&lt;BR /&gt;Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;as I posted above (sorry but my post was delayed by moderation for several days!) maybe the compiler use unaligned moves because the encoding is more compact (to be verified)&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jul 2013 15:40:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941126#M3904</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-07-08T15:40:00Z</dc:date>
    </item>
    <item>
      <title>Just an update.. upon HW in</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941127#M3905</link>
      <description>&lt;P&gt;Just an update.. upon HW in 256-bits there's no alignment penalty for loads which are mis-aligned from 256-bit alignment when using VMOVUPS.. but there's a penalty for spanning a cachline boundary and a page boundary.&lt;/P&gt;
&lt;P&gt;Perfwise&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jul 2013 22:57:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941127#M3905</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2013-07-08T22:57:08Z</dc:date>
    </item>
    <item>
      <title>But if you write const _</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941128#M3906</link>
      <description>&lt;P&gt;But if you write const __m256i var = something; isnt the compiler free to align/order that value properly in read-only data segment?&lt;BR /&gt;Why would it ever need to use unaligned loads then when it can guarantee that the data will be properly aligned even without explicitly specifying __declspec(align(32))?&lt;/P&gt;
&lt;P&gt;By the way, specifying alignment on __m256i variable doesn't force aligned loads in 13.1 update 5.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jul 2013 23:11:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941128#M3906</guid>
      <dc:creator>levicki</dc:creator>
      <dc:date>2013-07-08T23:11:48Z</dc:date>
    </item>
    <item>
      <title>This is a short follow up on</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941129#M3907</link>
      <description>This is a short follow up on Igor's test results:

&amp;gt;&amp;gt;Attached is the code with simple test driver. My results are:
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;     test_C : 6345.035 ms
&amp;gt;&amp;gt;test_SSE4.1 : 3944.771 ms
&amp;gt;&amp;gt;  test_AVX2 : 2190.420 ms

&lt;STRONG&gt;Intel Core i7-3840QM ( 2.80 GHz )&lt;/STRONG&gt;
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846

&lt;STRONG&gt;[ 64-bit Windows 7 Professional / 64-bit test ]&lt;/STRONG&gt;

     test_C : 12904.534 ms
test_SSE4.1 :  6502.829 ms

&lt;STRONG&gt;[ 64-bit Windows 7 Professional / 32-bit test ]&lt;/STRONG&gt;

     test_C : 12423.721 ms
test_SSE4.1 :  7097.624 ms</description>
      <pubDate>Tue, 09 Jul 2013 17:39:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/How-to-extract-DWORD-from-upper-half-of-256-bit-register/m-p/941129#M3907</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-07-09T17:39:36Z</dc:date>
    </item>
  </channel>
</rss>

