<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Alignment requirements for _mm256_maskload_pd in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996752#M4800</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Are there any alignment requirements (beyond 8 bytes) for _mm256_maskload_pd and likewise for &lt;SPAN class="sig"&gt;&lt;SPAN class="name"&gt;_mm256_maskstore_pd?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks&lt;/P&gt;</description>
    <pubDate>Mon, 11 May 2015 07:41:02 GMT</pubDate>
    <dc:creator>Stephen_G_1</dc:creator>
    <dc:date>2015-05-11T07:41:02Z</dc:date>
    <item>
      <title>Alignment requirements for _mm256_maskload_pd</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996752#M4800</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Are there any alignment requirements (beyond 8 bytes) for _mm256_maskload_pd and likewise for &lt;SPAN class="sig"&gt;&lt;SPAN class="name"&gt;_mm256_maskstore_pd?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Mon, 11 May 2015 07:41:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996752#M4800</guid>
      <dc:creator>Stephen_G_1</dc:creator>
      <dc:date>2015-05-11T07:41:02Z</dc:date>
    </item>
    <item>
      <title>Hi Stephen,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996753#M4801</link>
      <description>&lt;P&gt;Hi Stephen,&lt;BR /&gt;
	&lt;BR /&gt;
	My time experiments with both on a Haswell show that:&lt;BR /&gt;
	&lt;BR /&gt;
	_mm256_maskload_pd&lt;BR /&gt;
	does not depend on alignment and is 4(!) times as slow as&amp;nbsp;&lt;SPAN style="font-size: 13.007999420166px; line-height: 15.6096038818359px;"&gt;_mm256_loadu_pd.&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	_mm256_maskstore_pd&lt;BR /&gt;
	&lt;SPAN style="font-size: 13.007999420166px; line-height: 15.6096038818359px;"&gt;is 12%..25% slower than&amp;nbsp;_mm256_storeu_pd&amp;nbsp;if you don't cross cache line boundary ((addr % 64) &amp;lt;= 32)&amp;nbsp;and&lt;BR /&gt;
	has same speed as _mm256_storeu_pd&amp;nbsp;otherwise (3 times slower than with&amp;nbsp;((addr % 64) &amp;lt;= 32)).&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	Both don't depend on mask.&lt;/P&gt;</description>
      <pubDate>Mon, 11 May 2015 13:26:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996753#M4801</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2015-05-11T13:26:50Z</dc:date>
    </item>
    <item>
      <title>Hello,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996754#M4802</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;Haswell and alignment seems to have some special things. I noticed a code running with aligned load, that worked for all kind of unaligned loads. I realized this as my code suddenly crashed on Sandy Bridge. Traced it back and realized that Haswell and _mm256_load_xx works with any address.&lt;/P&gt;

&lt;P&gt;If maskload is that expensive looks like it's similiar implement to scather-gather instructions.&lt;/P&gt;

&lt;P&gt;Has anyone tested which is the best method to load unaligned on haswell?&lt;/P&gt;</description>
      <pubDate>Tue, 12 May 2015 07:34:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996754#M4802</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2015-05-12T07:34:40Z</dc:date>
    </item>
    <item>
      <title>Hi Christian,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996755#M4803</link>
      <description>&lt;P&gt;Hi Christian,&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 14.4000024795532px;"&gt;_mm256_load_xx can't work with unaligned address. This code crashes:&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; __ALIGN(32) float&amp;nbsp;&amp;nbsp; &amp;nbsp;_f[100];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; float * volatile&amp;nbsp;&amp;nbsp; &amp;nbsp;f = _f;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; volatile __m256&amp;nbsp;&amp;nbsp; &amp;nbsp;v = _mm256_load_ps(f + 1);&lt;BR /&gt;
	Compiler is usually smart enough to replace your load's with &lt;/SPAN&gt;&lt;SPAN style="font-size: 13.007999420166px; line-height: 15.6096038818359px;"&gt;load&lt;STRONG&gt;u&lt;/STRONG&gt;'s.&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;The best way to load unaligned 256-bit is:&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; __m256&amp;nbsp;&amp;nbsp; &amp;nbsp;v;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="font-size: 13.007999420166px; line-height: 15.6096038818359px;"&gt;float&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; p[16] = {1, 0, 0, 0, 1};&lt;BR /&gt;
	&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_loadu_ps(p)), *((__m128 *)p + 1), 1);&lt;BR /&gt;
	&lt;BR /&gt;
	It loads the low order half and inserts the high order half from memory.&lt;BR /&gt;
	This is the method used by Intel compiler.&amp;nbsp;VC and GCC call&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.007999420166px; line-height: 15.6096038818359px;"&gt;_mm256_loadu_ps.&lt;BR /&gt;
	&lt;BR /&gt;
	Same approach can be used for&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.007999420166px; line-height: 15.6096038818359px;"&gt;unaligned stores.&lt;/SPAN&gt;&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 May 2015 16:03:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996755#M4803</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2015-05-12T16:03:59Z</dc:date>
    </item>
    <item>
      <title>Hello Vladimir,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996756#M4804</link>
      <description>&lt;P&gt;Hello Vladimir,&lt;/P&gt;

&lt;P&gt;ah this calms me down, that aligned load does not work with unaligned one. As the lopp extracted sliding window with an increment of one and a fixed window size, compiler seems to have realized this can't be aligned all the time.&lt;/P&gt;

&lt;P&gt;I already use _mm256_loadu_xx. What about &lt;SPAN class="sig"&gt;&lt;SPAN class="name"&gt;mm256_loadu2_m128 and _mm256_lddqu_si256? Especially the last one, might perform better says the intrincis guide.&lt;/SPAN&gt;&lt;/SPAN&gt; But it does not give any details. Agner Fog's instruction table only lists lddqu ymm, m128 but no vlddqu ymm, m256.&lt;/P&gt;</description>
      <pubDate>Wed, 13 May 2015 08:03:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996756#M4804</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2015-05-13T08:03:58Z</dc:date>
    </item>
    <item>
      <title>Hello Christian,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996757#M4805</link>
      <description>&lt;P&gt;Hello Christian,&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 14.4000024795532px;"&gt;&amp;nbsp;_mm256_lddqu_si256 is of the same speed as&amp;nbsp;&amp;nbsp;_mm256_loadu_si256.&lt;BR /&gt;
	&lt;BR /&gt;
	_mm256_loadu2_m128 is &lt;/SPAN&gt;essentially&amp;nbsp;faster than loadu because it's using insertion as I said before.&lt;BR /&gt;
	It's ~28% faster when loading from a limited area in a loop (L1 cache).&lt;/P&gt;</description>
      <pubDate>Wed, 13 May 2015 08:59:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996757#M4805</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2015-05-13T08:59:00Z</dc:date>
    </item>
    <item>
      <title>Here is something about lddqu</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996758#M4806</link>
      <description>&lt;P&gt;Here is something about lddqu and movdqu: &lt;A href="https://software.intel.com/en-us/blogs/2012/04/16/history-of-one-cpu-instructions-part-1-lddqumovdqu-explained" target="_blank"&gt;https://software.intel.com/en-us/blogs/2012/04/16/history-of-one-cpu-instructions-part-1-lddqumovdqu-explained&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;The article says that Core 2 and later systems implement lddqu and movdqu similarly but does not clarify how exactly. Given that the original lddqu was not suitable for all memory types I would guess that in recent archtectures both lddqu and movdqu load 16 bytes (32 bytes in case of ymm registers) of unaligned data. I have not seen a confirmation on this though.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 May 2015 10:56:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996758#M4806</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2015-05-14T10:56:36Z</dc:date>
    </item>
    <item>
      <title>Thanks for all the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996759#M4807</link>
      <description>&lt;P&gt;Thanks for all the information!&lt;/P&gt;

&lt;P&gt;Vladimir,&lt;/P&gt;

&lt;P&gt;as you said with VC and Intel Compiler loadu2 won't give me a benefit as loadu is already implemented with the insert.&lt;/P&gt;

&lt;P&gt;andysem,&lt;/P&gt;

&lt;P&gt;I read through the article. I agree, on modern CPUs both instructions should behave the same way as the support SSSE3.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 20 May 2015 07:26:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Alignment-requirements-for-mm256-maskload-pd/m-p/996759#M4807</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2015-05-20T07:26:46Z</dc:date>
    </item>
  </channel>
</rss>

