<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I addressed this recently in in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170006#M6647</link>
    <description>&lt;P&gt;I addressed this recently in a different forum topic, but I can't find the reference right now....&lt;/P&gt;

&lt;P&gt;In the beginning, SSE supported unaligned 128-bit loads/stores only via the MOVUPS instruction.&amp;nbsp; All 128-bit memory references that were input arguments to other instructions were required to be 128-bit aligned to avoid a protection fault.&amp;nbsp;&amp;nbsp; In the earliest SSE systems, MOVAPS was faster, so it was preferred when the data was known to be aligned.&amp;nbsp;&amp;nbsp;&amp;nbsp; Later systems eliminated the performance penalty of MOVUPS in the case where the data was aligned, so the compiler switched to generating MOVUPS even in the cases where it knew the data was aligned.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;AVX relaxed the alignment restrictions for input arguments for both 128-bit and 256-bit loads.&amp;nbsp;&amp;nbsp; BUT, every generation of processor had different performance penalties for executing these memory references without natural alignment.&lt;/P&gt;

&lt;P&gt;From memory:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Sandy Bridge
		&lt;UL&gt;
			&lt;LI&gt;Loads
				&lt;UL&gt;
					&lt;LI&gt;2 loads per cycle (up to 128-bit) in the absence of bank conflicts or cache line crossing.
						&lt;UL&gt;
							&lt;LI&gt;I.e.., no penalty for unaligned loads that do not cross a cache line boundary.&lt;/LI&gt;
						&lt;/UL&gt;
					&lt;/LI&gt;
					&lt;LI&gt;128-bit loads that cross a cache line boundary reduce the rate to 1 load every 2 cycles.&lt;/LI&gt;
					&lt;LI&gt;256-bit loads take 2 cycles, but two can execute in parallel in the absence of bank conflicts or cache line crossing.&lt;/LI&gt;
					&lt;LI&gt;256-bit loads that cross a cache line boundary reduce the rate to 1 load every 4 cycles.&lt;/LI&gt;
					&lt;LI&gt;Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads.&amp;nbsp; The detailed mechanisms are not clear.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Stores
				&lt;UL&gt;
					&lt;LI&gt;Big (?) penalty for any sized store that crosses a cache line boundary.&lt;/LI&gt;
					&lt;LI&gt;Huge (&amp;gt;100 cycle) penalty for any store that crosses a 4KiB page boundary.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Because there are only 2 address generation units, it is not possible to perform 2 loads and 1 store per cycle.
				&lt;UL&gt;
					&lt;LI&gt;2 256-bit loads plus 1 256-bit store every 2 cycles is supported, but it is extremely difficult to avoid bank conflicts in this case.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Ivy Bridge
		&lt;UL&gt;
			&lt;LI&gt;I think there were reductions in the penalties for cache-line and page crossing, but I don't recall that I ever measured them in detail.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Haswell
		&lt;UL&gt;
			&lt;LI&gt;Loads
				&lt;UL&gt;
					&lt;LI&gt;2 loads per cycle (up to 256-bit) for any alignment in the absence of cache line crossing.&lt;/LI&gt;
					&lt;LI&gt;1 load per cycle for any sized load that crosses a cache line boundary.&lt;/LI&gt;
					&lt;LI&gt;Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads.&amp;nbsp; The detailed mechanisms are not clear.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Stores
				&lt;UL&gt;
					&lt;LI&gt;One store per cycle (any size or alignment) as long as it does not cross a cache line boundary.&lt;/LI&gt;
					&lt;LI&gt;I think that the penalties for cache-line-crossing and 4KiB-page-crossing are much smaller than on SNB, but I don't have the numbers handy.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;A 3rd address generation unit was added to allow 2 loads plus 1 store per cycle.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Skylake Xeon
		&lt;UL&gt;
			&lt;LI&gt;I have not tested this yet, but it certainly supports 2 512-bit aligned loads per cycle, or 1 512-bit aligned load plus any other load that does not cross a cache line boundary.
				&lt;UL&gt;
					&lt;LI&gt;This could be built on the same physical interface that Haswell uses -- dual-read-port, 512-bit port width.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Skylake Xeon does not appear to be able to support 2 512-bit loads plus 1 512-bit store per cycle, but the reported performance is slightly higher than 2 512-bit loads per cycle.&amp;nbsp;&amp;nbsp; I have not checked to see whether this inability to fully overlap also applies to 128-bit and/or 256-bit 2-load-plus-1-store combinations.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Fri, 08 Dec 2017 16:51:00 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2017-12-08T16:51:00Z</dc:date>
    <item>
      <title>SSE and AVX behavior with aligned/unaligned instructions</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170000#M6641</link>
      <description>&lt;P&gt;We've learned that if the compiler emits an aligned SSE memory move instruction for an unaligned address, it will cause a SEGV. Will the same occur with AVX? Or in the case of AVX is the extent of the resulting behavior amount to undesirable performance?&lt;/P&gt;</description>
      <pubDate>Thu, 07 Dec 2017 22:17:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170000#M6641</guid>
      <dc:creator>Mark_D_9</dc:creator>
      <dc:date>2017-12-07T22:17:42Z</dc:date>
    </item>
    <item>
      <title>The Fine Manual gives details</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170001#M6642</link>
      <description>&lt;P&gt;I think you answered the question yourself: aligned instructions require alignment!&lt;/P&gt;

&lt;P&gt;In more detail:-&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/intel-sdm"&gt;The Fine Manual&lt;/A&gt;&amp;nbsp;gives details of the properties of each instruction, a&lt;SPAN style="font-size: 1em;"&gt;s does the online intrinsics guide; &lt;/SPAN&gt;&lt;A href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX&amp;amp;expand=3585&amp;amp;cats=Load" style="font-size: 1em;"&gt;here &lt;/A&gt;&lt;SPAN style="font-size: 1em;"&gt;you can see the properties of AVX load instructions. You will observe that there are both aligned and unaligned loads, for instance :-&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV class="signature" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; cursor: pointer; padding-top: 2px; color: rgb(0, 0, 0); font-size: 16px;"&gt;&lt;SPAN class="sig" style="color: rgb(102, 102, 102);"&gt;&lt;SPAN class="rettype" style="color: rgb(0, 0, 136);"&gt;_m256d&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="name" style="color: rgb(0, 0, 0);"&gt;_mm256_load_pd&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN class="param_type" style="color: rgb(0, 0, 136);"&gt;double const *&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="param_name" style="color: rgb(0, 102, 85);"&gt;mem_addr&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="details" style="padding: 10px; font-size: 12.8px; color: rgb(0, 0, 0); font-family: &amp;quot;Intel Clear&amp;quot;, &amp;quot;Intel Clear IE&amp;quot;, Tahoma, sans-serif;"&gt;
	&lt;H1 style="font-size: 14.08px;"&gt;Synopsis&lt;/H1&gt;

	&lt;DIV class="synopsis" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; padding: 10px; line-height: 16.64px;"&gt;&lt;SPAN class="sig" style="color: rgb(102, 102, 102);"&gt;&lt;SPAN class="rettype" style="color: rgb(0, 0, 136);"&gt;__m256d&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="name" style="color: rgb(0, 0, 0);"&gt;_mm256_load_pd&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN class="param_type" style="color: rgb(0, 0, 136);"&gt;double const *&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="param_name" style="color: rgb(0, 102, 85);"&gt;mem_addr&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;BR /&gt;
		#include "immintrin.h"&lt;BR /&gt;
		Instruction: vmovapd ymm, m256&lt;BR /&gt;
		CPUID Flags:&amp;nbsp;&lt;SPAN class="cpuid"&gt;AVX&lt;/SPAN&gt;&lt;/DIV&gt;

	&lt;H1 style="font-size: 14.08px;"&gt;Description&lt;/H1&gt;

	&lt;DIV class="description" style="padding: 10px; font-size: 14.08px;"&gt;Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into&amp;nbsp;&lt;SPAN class="desc_var dst" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; color: rgb(0, 102, 85);"&gt;dst&lt;/SPAN&gt;.&amp;nbsp;&lt;SPAN class="desc_var mem_addr" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; color: rgb(0, 102, 85);"&gt;mem_addr&lt;/SPAN&gt;&amp;nbsp;must be aligned on a 32-byte boundary or a general-protection exception may be generated.&lt;SPAN style="color: rgb(83, 87, 94); font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;

&lt;P&gt;and&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="rettype" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px; color: rgb(0, 0, 136);"&gt;__m256d&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="name" style="color: rgb(0, 0, 0); font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px;"&gt;_mm256_loadu_pd&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px;"&gt;&amp;nbsp;(&lt;/SPAN&gt;&lt;SPAN class="param_type" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px; color: rgb(0, 0, 136);"&gt;double const *&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="param_name" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px; color: rgb(0, 102, 85);"&gt;mem_addr&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; font-size: 16px;"&gt;)&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV class="description" style="color: rgb(0, 0, 0); font-family: &amp;quot;Intel Clear&amp;quot;, &amp;quot;Intel Clear IE&amp;quot;, Tahoma, sans-serif; padding: 10px; font-size: 14.08px;"&gt;
	&lt;DIV class="details" style="padding: 10px; font-size: 12.8px;"&gt;
		&lt;H1 style="font-size: 14.08px;"&gt;Synopsis&lt;/H1&gt;

		&lt;DIV class="synopsis" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; padding: 10px; line-height: 16.64px;"&gt;&lt;SPAN class="sig" style="color: rgb(102, 102, 102);"&gt;&lt;SPAN class="rettype" style="color: rgb(0, 0, 136);"&gt;__m256d&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="name" style="color: rgb(0, 0, 0);"&gt;_mm256_loadu_pd&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN class="param_type" style="color: rgb(0, 0, 136);"&gt;double const *&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN class="param_name" style="color: rgb(0, 102, 85);"&gt;mem_addr&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;BR /&gt;
			#include "immintrin.h"&lt;BR /&gt;
			Instruction: vmovupd ymm, m256&lt;BR /&gt;
			CPUID Flags:&amp;nbsp;&lt;SPAN class="cpuid"&gt;AVX&lt;/SPAN&gt;&lt;/DIV&gt;

		&lt;H1 style="font-size: 14.08px;"&gt;Description&lt;/H1&gt;

		&lt;DIV class="description" style="padding: 10px; font-size: 14.08px;"&gt;Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into&amp;nbsp;&lt;SPAN class="desc_var dst" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; color: rgb(0, 102, 85);"&gt;dst&lt;/SPAN&gt;.&amp;nbsp;&lt;SPAN class="desc_var mem_addr" style="font-family: &amp;quot;Oxygen Mono&amp;quot;, Monaco, monospace; color: rgb(0, 102, 85);"&gt;mem_addr&lt;/SPAN&gt;&amp;nbsp;does not need to be aligned on any particular boundary.&lt;/DIV&gt;
	&lt;/DIV&gt;
&lt;/DIV&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Dec 2017 09:46:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170001#M6642</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2017-12-08T09:46:45Z</dc:date>
    </item>
    <item>
      <title>Aligned VEX-encoded loads and</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170002#M6643</link>
      <description>&lt;P&gt;Aligned VEX-encoded loads and stores (i.e. vmovdqa) still require aligned memory operands. However, memory operands for other VEX-encoded instructions (e.g. vpaddd) need not be aligned. You will still pay performance penalty for unaligned memory access though. Refer to Intel Software Developer Manual for the description of particular instructions.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Dec 2017 11:43:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170002#M6643</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2017-12-08T11:43:15Z</dc:date>
    </item>
    <item>
      <title>SSE instructions on AVX cpu</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170003#M6644</link>
      <description>SSE instructions on AVX cpu accept more cases of misaligned data than on earlier cpus.  I don't think it's well documented in case it is part of your question.</description>
      <pubDate>Fri, 08 Dec 2017 11:54:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170003#M6644</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-12-08T11:54:06Z</dc:date>
    </item>
    <item>
      <title>@Tim P. AFAIK, legacy SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170004#M6645</link>
      <description>&lt;P&gt;@&lt;A href="https://software.intel.com/en-us/user/336903"&gt;Tim P.&lt;/A&gt; AFAIK, legacy SSE instructions&lt;A href="https://software.intel.com/en-us/user/336903"&gt; &lt;/A&gt;(i.e. non-VEX-encoded) haven't changed and still require aligned memory operands where they previously did. Only the VEX-encoded equivalents have relaxed requirements.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Dec 2017 12:06:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170004#M6645</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2017-12-08T12:06:47Z</dc:date>
    </item>
    <item>
      <title>SSE arithmetic instructions</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170005#M6646</link>
      <description>SSE arithmetic instructions may accept an unaligned operand on an AVX cpu. It goes without saying that it is inadvisable to try to take advantage of this.  It raises the possibility of unexpected failure when changing cpu. 
Then again Intel may have changed undocumented behavior on more recent cpus after originally planning to match amd (who might also have changed).</description>
      <pubDate>Fri, 08 Dec 2017 14:31:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170005#M6646</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-12-08T14:31:00Z</dc:date>
    </item>
    <item>
      <title>I addressed this recently in</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170006#M6647</link>
      <description>&lt;P&gt;I addressed this recently in a different forum topic, but I can't find the reference right now....&lt;/P&gt;

&lt;P&gt;In the beginning, SSE supported unaligned 128-bit loads/stores only via the MOVUPS instruction.&amp;nbsp; All 128-bit memory references that were input arguments to other instructions were required to be 128-bit aligned to avoid a protection fault.&amp;nbsp;&amp;nbsp; In the earliest SSE systems, MOVAPS was faster, so it was preferred when the data was known to be aligned.&amp;nbsp;&amp;nbsp;&amp;nbsp; Later systems eliminated the performance penalty of MOVUPS in the case where the data was aligned, so the compiler switched to generating MOVUPS even in the cases where it knew the data was aligned.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;AVX relaxed the alignment restrictions for input arguments for both 128-bit and 256-bit loads.&amp;nbsp;&amp;nbsp; BUT, every generation of processor had different performance penalties for executing these memory references without natural alignment.&lt;/P&gt;

&lt;P&gt;From memory:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Sandy Bridge
		&lt;UL&gt;
			&lt;LI&gt;Loads
				&lt;UL&gt;
					&lt;LI&gt;2 loads per cycle (up to 128-bit) in the absence of bank conflicts or cache line crossing.
						&lt;UL&gt;
							&lt;LI&gt;I.e.., no penalty for unaligned loads that do not cross a cache line boundary.&lt;/LI&gt;
						&lt;/UL&gt;
					&lt;/LI&gt;
					&lt;LI&gt;128-bit loads that cross a cache line boundary reduce the rate to 1 load every 2 cycles.&lt;/LI&gt;
					&lt;LI&gt;256-bit loads take 2 cycles, but two can execute in parallel in the absence of bank conflicts or cache line crossing.&lt;/LI&gt;
					&lt;LI&gt;256-bit loads that cross a cache line boundary reduce the rate to 1 load every 4 cycles.&lt;/LI&gt;
					&lt;LI&gt;Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads.&amp;nbsp; The detailed mechanisms are not clear.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Stores
				&lt;UL&gt;
					&lt;LI&gt;Big (?) penalty for any sized store that crosses a cache line boundary.&lt;/LI&gt;
					&lt;LI&gt;Huge (&amp;gt;100 cycle) penalty for any store that crosses a 4KiB page boundary.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Because there are only 2 address generation units, it is not possible to perform 2 loads and 1 store per cycle.
				&lt;UL&gt;
					&lt;LI&gt;2 256-bit loads plus 1 256-bit store every 2 cycles is supported, but it is extremely difficult to avoid bank conflicts in this case.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Ivy Bridge
		&lt;UL&gt;
			&lt;LI&gt;I think there were reductions in the penalties for cache-line and page crossing, but I don't recall that I ever measured them in detail.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Haswell
		&lt;UL&gt;
			&lt;LI&gt;Loads
				&lt;UL&gt;
					&lt;LI&gt;2 loads per cycle (up to 256-bit) for any alignment in the absence of cache line crossing.&lt;/LI&gt;
					&lt;LI&gt;1 load per cycle for any sized load that crosses a cache line boundary.&lt;/LI&gt;
					&lt;LI&gt;Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads.&amp;nbsp; The detailed mechanisms are not clear.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Stores
				&lt;UL&gt;
					&lt;LI&gt;One store per cycle (any size or alignment) as long as it does not cross a cache line boundary.&lt;/LI&gt;
					&lt;LI&gt;I think that the penalties for cache-line-crossing and 4KiB-page-crossing are much smaller than on SNB, but I don't have the numbers handy.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;A 3rd address generation unit was added to allow 2 loads plus 1 store per cycle.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Skylake Xeon
		&lt;UL&gt;
			&lt;LI&gt;I have not tested this yet, but it certainly supports 2 512-bit aligned loads per cycle, or 1 512-bit aligned load plus any other load that does not cross a cache line boundary.
				&lt;UL&gt;
					&lt;LI&gt;This could be built on the same physical interface that Haswell uses -- dual-read-port, 512-bit port width.&lt;/LI&gt;
				&lt;/UL&gt;
			&lt;/LI&gt;
			&lt;LI&gt;Skylake Xeon does not appear to be able to support 2 512-bit loads plus 1 512-bit store per cycle, but the reported performance is slightly higher than 2 512-bit loads per cycle.&amp;nbsp;&amp;nbsp; I have not checked to see whether this inability to fully overlap also applies to 128-bit and/or 256-bit 2-load-plus-1-store combinations.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 08 Dec 2017 16:51:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170006#M6647</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-12-08T16:51:00Z</dc:date>
    </item>
    <item>
      <title>Elimination of compiler use</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170007#M6648</link>
      <description>Elimination of compiler use of aligned loads applies only to AVX and newer targets, as older ISA code selection in Intel compilers is generally optimized for the oldest corresponding cpu. 
The penalty for 256 bit unaligned access on Sandy bridge was so large that compilers would always split access to 128 bit pairs.  Ivy bridge greatly reduced the penalty but not to the extent that compilers needed to eliminate the splitting. Intel compilers when directed to generate both Sandy and ivy bridge paths should produce only the path optimized for Sandy bridge.
I suppose the incentive for some of us to learn about avx512 specifics is reduced because the lack of client cpu support.</description>
      <pubDate>Sat, 09 Dec 2017 01:46:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-and-AVX-behavior-with-aligned-unaligned-instructions/m-p/1170007#M6648</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-12-09T01:46:23Z</dc:date>
    </item>
  </channel>
</rss>

