<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Would be processor always use in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081583#M5705</link>
    <description>&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Would be processor always use this technique, what if all register is already in use? Are any more regsiters internally available for processors single core?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;AFAIK, register renaming works on any write to a register, probably even when not strictly required. There are many more internal registers in the CPU than are exposed through xmm/ymm names, so there is never a case when there is no spare internal register. For example, this article (&lt;A href="http://www.realworldtech.com/haswell-cpu/3/"&gt;http://www.realworldtech.com/haswell-cpu/3/&lt;/A&gt;) states there are 144&amp;nbsp;vector registers in Sandy Bridge and 168 in Haswell, and only 16 of them are exposed to the code.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Other interesting question - if there is a lot of shadow registers, why the number of real registers is much lesser?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Because exposing more registers would require more space in the instruction encoding.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;And why it is not possible to do the same in case #3 (it is clear with dependencies) - decode instruction in parallel, start instruction in parallel inside pipeline AND use internal result from previous operation without waiting the result goes to real XMM register, I remember something about that, but not really sure it is implemented by Intel or maybe that was completely wrong information.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;In order to execute an instruction, all its input values have to be ready. That means that the previous results have to be written to a (renamed) register. That's what forces these instructions to go sequentially. Renaming a register is cheap, and this step is probably indivisible from executing the instruction.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;I.e. there are exactly 16 _mm128i variables, so that 16 xmm regsiters can be used without "caching" - but caching is used.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Having 16 __m128i variables does not guarantee that this many registers are needed. In fact, the correspondence between the variables and registers is rather weak and inconclusive. The compiler is able to rearrange the code so that some of the variables are not used and this way it may free some registers for other purposes. Some additional registers may be required for temporary results or for implementing non-destructive operations. Loop unrolling also significantly increases register consumption. So, while I'm not of a very high opinion about MSVC, it may have had valid reasons to spill registers.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 30 Jan 2017 10:29:00 GMT</pubDate>
    <dc:creator>andysem</dc:creator>
    <dc:date>2017-01-30T10:29:00Z</dc:date>
    <item>
      <title>Question about latency</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081578#M5700</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;AVXOP xmm0, xmm0, xmm1&lt;/SPAN&gt;&amp;nbsp; Hi,&lt;/P&gt;

&lt;P&gt;years ago I've read and heard different mystic things about latencies caused by a regsiter choise if using AVX, and why it is better to use AVX instead of SSE - nondestructive operations would be performed better as destructive. Now, years ago that may differs. In short:&lt;/P&gt;

&lt;P&gt;Assume there are simply values (signed bytes, words, etc, - does not matter), say in xmm0, xmm1, xmm2, xmm3.&lt;BR /&gt;
	We wand to calculate sum or max or min - does not matter (?)&lt;BR /&gt;
	And we a free to use other regsiters too, so we use something like AVXOP R1, R2, R3 (here I will use AVXOP for a real instruction!).&lt;BR /&gt;
	Now we can do following:&lt;/P&gt;

&lt;P&gt;1. (always use different regsiter as destination operand)&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em;"&gt;AVXOP xmm4, xmm0, xmm1&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 13.008px;"&gt;AVXOP xmm5, xmm2, xmm3&lt;BR /&gt;
	AVXOP xmm6, xmm4, xmm5 ; xmm6 is result, total use 3 additional registers&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;2.&amp;nbsp;(always use same regsiter as destination operand, but minimize "cross use" potentially unready source regsiter from one operation as destination register in followed operation )&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em;"&gt;AVXOP xmm0, xmm0, xmm1&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 13.008px;"&gt;AVXOP xmm2, xmm2, xmm3&lt;BR /&gt;
	AVXOP xmm4, xmm0, xmm2 ; xmm4 is result, total use no additional registers&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;3.&amp;nbsp;(always use same regsiter as destination operand&amp;nbsp;but don't minimize "cross use" potentially unready source regsiter from one operation as destination register in followed operation )&lt;BR /&gt;
	AVXOP xmm0, xmm0, xmm1&lt;BR /&gt;
	AVXOP xmm0, xmm0, xmm2&lt;BR /&gt;
	AVXOP&amp;nbsp;xmm0, xmm0, xmm3&amp;nbsp;; xmm0 is result, total use no additional registers.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;If they are all equal, so next question would be, it looks better to use SSE instead of, because instructions are shorter and has less latency, so the next code like:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;SSEOP xmm0, xmm1&lt;/SPAN&gt;&lt;BR style="font-size: 13.008px;" /&gt;
	&lt;SPAN style="font-size: 13.008px;"&gt;SSEOP&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;xmm0, xmm2&lt;/SPAN&gt;&lt;BR style="font-size: 13.008px;" /&gt;
	&lt;SPAN style="font-size: 13.008px;"&gt;SSEOP&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;xmm0, xmm3&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;; xmm0 is result&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;, total use no additional registers.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;will win about 3 upper variants like.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;One question more, has Intel some decision to reduce stalls if SSE and AVX (AVX2) were intermixed?&lt;/P&gt;

&lt;P&gt;And the last question (now with real code snippet):&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm0 = _mm_load_si128((__m128i*)(rsi + rdx * 2 - 0x00000010));&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm5 = _mm_load_si128((__m128i*)(rsi + rdx * 2));&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm6 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000010));&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm7 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000020));&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm11 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000030));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;//; [X - 2, Y - 2..X + 2, Y - 2] == &amp;gt; XMM0..XMM4&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm1 = _mm_alignr_epi8(xmm6, xmm5, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm2 = _mm_alignr_epi8(xmm6, xmm5, 2);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm3 = _mm_alignr_epi8(xmm5, xmm0, 15);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm4 = _mm_alignr_epi8(xmm5, xmm0, 14);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm0 = _mm_max_epi8(xmm1, xmm2);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm2 = _mm_max_epi8(xmm3, xmm4);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm3 = _mm_max_epi8(xmm0, xmm5);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm8 = _mm_max_epi8(xmm3, xmm2); // MAX(XMM0, XMM1, XMM2, XMM3, XMM4) == &amp;gt; RESULT IN XMM8&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; ...... xmm7 and xmm11 will be used from here now.&lt;/P&gt;

&lt;P&gt;Question - is this a good idea to start loading as soon as possible (all 5 values) in order to get it from memory (with high probability the values are not cached) or better to interleave code like:&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm0 = _mm_load_si128((__m128i*)(rsi + rdx * 2 - 0x00000010));&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm5 = _mm_load_si128((__m128i*)(rsi + rdx * 2));&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm6 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000010));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;//; [X - 2, Y - 2..X + 2, Y - 2] == &amp;gt; XMM0..XMM4&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm1 = _mm_alignr_epi8(xmm6, xmm5, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm2 = _mm_alignr_epi8(xmm6, xmm5, 2);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm3 = _mm_alignr_epi8(xmm5, xmm0, 15);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm4 = _mm_alignr_epi8(xmm5, xmm0, 14);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp; &amp;nbsp; xmm7 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000020));&lt;/SPAN&gt;&lt;BR style="font-size: 13.008px;" /&gt;
	&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm11 = _mm_load_si128((__m128i*)(rsi + rdx * 2 + 0x00000030));&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;xmm0 = _mm_max_epi8(xmm1, xmm2);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm2 = _mm_max_epi8(xmm3, xmm4);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm3 = _mm_max_epi8(xmm0, xmm5);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xmm8 = _mm_max_epi8(xmm3, xmm2); // MAX(XMM0, XMM1, XMM2, XMM3, XMM4) == &amp;gt; RESULT IN XMM8&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp; &amp;nbsp; ...... xmm7 and xmm11 will be used from here now.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Or it is all equal and would be done by prefetcher either ( using of prefetch instruction and something like register preload is not a question).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;Alex&lt;/P&gt;</description>
      <pubDate>Sun, 29 Jan 2017 15:09:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081578#M5700</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2017-01-29T15:09:08Z</dc:date>
    </item>
    <item>
      <title>Regarding the first part of</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081579#M5701</link>
      <description>&lt;P&gt;Regarding the first part of your question, the cases #1 and #2 are equivalent from the perspective of performance. This is so because of register renaming that happens in the CPU. So even when your destination register is one of the input registers, the CPU internally saves the result of the instruction into a new register, which is then referred to as the destination register in the instruction. In other words:&lt;/P&gt;

&lt;PRE class="brush:;"&gt;AVXOP xmm0, xmm0, xmm1&lt;/PRE&gt;

&lt;P&gt;is equivalent to&lt;/P&gt;

&lt;PRE class="brush:;"&gt;AVXOP xmm0', xmm0, xmm1

[rename xmm0' to xmm0, so that the following instructions refer to the result of this instruction]&lt;/PRE&gt;

&lt;P&gt;The case #3 seems to have a different behavior and much lower performance because it contains a data dependency chain that contains all three instructions. This means that while cases #1 and #2 could potentially execute the first two instructions in parallel (provided that the CPU is capable to do that), in the case #3 all three instructions must be executed sequentially.&lt;/P&gt;

&lt;P&gt;Regarding SSE vs. AVX encoding and non-destructive instructions, the latter allows to eliminate movdqa instructions between registers that are otherwise required to achieve the same effect. This reduces the code size and also relieves instruction decoder. In many cases just recompiling the same code for AVX can provide a noticeable speedup.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;If they are all equal, so next question would be, it looks better to use SSE instead of, because instructions are shorter and has less latency&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Equivalent SSE and AVX instructions have the same latency and throughput, AFAIK. At least, I haven't seen a case where they are not. Also, I don't think there is any significant difference in instruction size between SSE and AVX (well, between SSE2 and AVX2 anyway). In both cases, instructions take about 4 bytes on average, when memory operands and immediate constants are not involved.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;One question more, has Intel some decision to reduce stalls if SSE and AVX (AVX2) were intermixed?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;I assume you mean the penalties caused by mixing 256-bit vector instructions from AVX/AVX2 with 128-bit vector instructions from SSE. You can already avoid those penalties by issuing vzeroupper or vzeroall instructions. Recent Intel architectures also reduced the penalties, but they are far from zero.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Question - is this a good idea to start loading as soon as possible (all 5 values) in order to get it from memory (with high probability the values are not cached) or better to interleave code&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;It is generally a good idea to start loading data beforehand, but I wouldn't give a general advice like that. There are many things to consider. First, your compiler may reorder code as it sees fit, so there may actually be no difference in how you arrange it. Second, there is instruction reordering in the CPU, with a rather large window, too, in recent architectures. So the exact instruction order has less significance in the modern days. Third, by issuing early loads you basically reserve registers that could have been useful for other means. That could in turn cause register spills and harm the performance more than you could potentially win by the early load. Then you should consider the number of the load ports your target CPU has. The CPU won't be able to issue more load instructions in parallel than it can, no matter how you order the code. In general, you should profile your code to see if one way or the other is beneficial for your case.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 29 Jan 2017 20:07:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081579#M5701</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2017-01-29T20:07:38Z</dc:date>
    </item>
    <item>
      <title>  Hello Andy,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081580#M5702</link>
      <description>&lt;P&gt;&amp;nbsp; Hello Andy,&lt;/P&gt;

&lt;P&gt;first of all, many thanks for useful information!&lt;/P&gt;

&lt;P&gt;Second, I wrote directly in assembler language, because of VS compiler often produces crazy inperformant code, save/restore values either there are enough register, etc.&lt;/P&gt;

&lt;P&gt;I've completely&amp;nbsp;forgot about internal register renaming, too much WPF, WCF and other distraction:)&lt;/P&gt;

&lt;P&gt;But... Would be processor always use this technique, what if all register is already in use? Are any more regsiters internally available for processors single core? And why it is not possible to do the same in case #3 (it is clear with dependencies) - decode instruction in parallel, start instruction in parallel inside pipeline AND use internal result from previous operation without waiting the result goes to real XMM register, I remember something about that, but not really sure it is implemented by Intel or maybe that was completely wrong information.&lt;/P&gt;

&lt;P&gt;Really good to know, VZERO... can avoid the penalties :)&lt;/P&gt;

&lt;P&gt;What about the last question? Code wrote in assembler. Start 6 loads (MOV(NT)DQA) per core sequuentially or intermix &amp;nbsp;with other instructions as shown above. Is it possible, that it is impossible to start 6 loads each-by-each and this would be a penalty? Once gain, compiler produce really bad code (as shown in other topic here), so I need to write in assembler.&lt;/P&gt;

&lt;P&gt;Alex&lt;/P&gt;</description>
      <pubDate>Sun, 29 Jan 2017 21:06:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081580#M5702</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2017-01-29T21:06:04Z</dc:date>
    </item>
    <item>
      <title>Interesting exchange, but I'm</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081581#M5703</link>
      <description>&lt;P&gt;Interesting exchange, but I'm perplexed about which compiler you complain about. I would much prefer to select the most efficient of several available compilers rather than optimize asm for a specific unspecified cpu.&lt;/P&gt;

&lt;P&gt;Shadow registers seem to be a nearly inexhaustible resource if you optimize number of threads per core and allow your compiler to work. &amp;nbsp;If you mix sse and avx intrinsics, intel c++ will try to avoid the transition stalls by promoting sse to avx or adding vzeroupper at function returns.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jan 2017 00:32:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081581#M5703</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-01-30T00:32:34Z</dc:date>
    </item>
    <item>
      <title>  Hello Tim,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081582#M5704</link>
      <description>&lt;P&gt;&amp;nbsp; Hello Tim,&lt;/P&gt;

&lt;P&gt;you are correct about compilers and assembler, but there is a big problem. The chief will have cheapest tools, cheapest work,&amp;nbsp;cheapest stuff (where 2/3 does not know waht to do at all, and 1/3 must do work for other 2/3 too) and nothing to invest, but they will be the "big number one" because of selling&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;cheapest&amp;nbsp;&lt;/SPAN&gt;systems - simply crazy. P.S. I will no longer work for that company because of such conditions :)&lt;/P&gt;

&lt;P&gt;Compiler is VS 2015 C++ compiler, the code was done in intrinsics. As mentioned in other thread the compiler produces significantly inperformant code. I.e. there are exactly 16 _mm128i variables, so that 16 xmm regsiters can be used without "caching" - but caching is used. Other problem: var1, var2, var3 is calculated in that order and var1, var2, var3 is also stored in that order, but the code is rearranged that var3, var2, var1 order by compiler - the resukt is pipeline stall. And the next really crazy thing is 4x _mm_stream_si128() is reordered so that resulting MOVNTDQA instrcutions are intermixed with other instructions, which prevent cacheline-write-throught and results in a total inperformant situation. Also, I have noticed, that rewrite code in assembler give speedup of factor 1.5-2.5 AND sometimes prevents cache pollution so that other algorithms speed-ups too.&lt;/P&gt;

&lt;P&gt;The next problem, the code vectorization must be done manually because there is a bunch of "cheapest" code. The number of threads must be ballanced manually too, because there many other threads, so additional threads can slow down the overall performance.&lt;/P&gt;

&lt;P&gt;One question, what you mean with "&lt;SPAN style="font-size: 12px;"&gt;Shadow registers seem to be a nearly inexhaustible resource if you optimize number of threads per core"? How many shadow registers per core are available. Other interesting question - if there is a lot of shadow registers, why the number of real registers is much lesser?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Alex&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jan 2017 09:14:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081582#M5704</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2017-01-30T09:14:00Z</dc:date>
    </item>
    <item>
      <title>Would be processor always use</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081583#M5705</link>
      <description>&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Would be processor always use this technique, what if all register is already in use? Are any more regsiters internally available for processors single core?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;AFAIK, register renaming works on any write to a register, probably even when not strictly required. There are many more internal registers in the CPU than are exposed through xmm/ymm names, so there is never a case when there is no spare internal register. For example, this article (&lt;A href="http://www.realworldtech.com/haswell-cpu/3/"&gt;http://www.realworldtech.com/haswell-cpu/3/&lt;/A&gt;) states there are 144&amp;nbsp;vector registers in Sandy Bridge and 168 in Haswell, and only 16 of them are exposed to the code.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Other interesting question - if there is a lot of shadow registers, why the number of real registers is much lesser?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Because exposing more registers would require more space in the instruction encoding.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;And why it is not possible to do the same in case #3 (it is clear with dependencies) - decode instruction in parallel, start instruction in parallel inside pipeline AND use internal result from previous operation without waiting the result goes to real XMM register, I remember something about that, but not really sure it is implemented by Intel or maybe that was completely wrong information.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;In order to execute an instruction, all its input values have to be ready. That means that the previous results have to be written to a (renamed) register. That's what forces these instructions to go sequentially. Renaming a register is cheap, and this step is probably indivisible from executing the instruction.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;I.e. there are exactly 16 _mm128i variables, so that 16 xmm regsiters can be used without "caching" - but caching is used.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Having 16 __m128i variables does not guarantee that this many registers are needed. In fact, the correspondence between the variables and registers is rather weak and inconclusive. The compiler is able to rearrange the code so that some of the variables are not used and this way it may free some registers for other purposes. Some additional registers may be required for temporary results or for implementing non-destructive operations. Loop unrolling also significantly increases register consumption. So, while I'm not of a very high opinion about MSVC, it may have had valid reasons to spill registers.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jan 2017 10:29:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081583#M5705</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2017-01-30T10:29:00Z</dc:date>
    </item>
    <item>
      <title>  Namy thanks for very</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081584#M5706</link>
      <description>&lt;P&gt;&amp;nbsp; Many thanks for very usefull information, andysem.&lt;/P&gt;

&lt;P&gt;The instruction may (but not always must) need one-two byte(s) more to encode if use more real registers, but this can significantly shorten the total number of instructions executed and can additionally prevent register "caching" so the real speedup wins clearly about some "bigger code" (ok, methods are small and fits in I-Cache) :)&lt;/P&gt;

&lt;P&gt;About MSVC compiler - I've seen the produced assembly code and done several benchmarks. Thats why I decided to (re)write in assembly. See my other posts here, you will wonder about the instructions order.&lt;/P&gt;

&lt;P&gt;P.S. Is Andy your real name?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Alex&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jan 2017 10:41:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081584#M5706</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2017-01-30T10:41:00Z</dc:date>
    </item>
    <item>
      <title>FWIW my experience is MSVC</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081585#M5707</link>
      <description>&lt;P&gt;FWIW my experience is MSVC&amp;nbsp;2015.3 can be&amp;nbsp;inefficient at register allocation, resulting in&amp;nbsp;spilling 2-3 registers sooner than necessary.&amp;nbsp; No measureable impact from instruction ordering or below threshold register use in my cases, but the slowdown's predictably precipitous when VC decides to spill two instead of fitting a loop into 15 registers like it should.&amp;nbsp; I've also found providing register hints tends to result in worse spilling rather than getting VC to understand it's being told the optimal register allocation.&amp;nbsp; I'd build in another compiler which can figure out the obvious solution before dropping to assembly but, if it has to be VC, well, yeah.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Feb 2017 05:23:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081585#M5707</guid>
      <dc:creator>Todd_W_</dc:creator>
      <dc:date>2017-02-05T05:23:34Z</dc:date>
    </item>
    <item>
      <title>   Hi Todd,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081586#M5708</link>
      <description>&lt;P&gt;&amp;nbsp; &amp;nbsp;Hi Todd,&lt;/P&gt;

&lt;P&gt;many thanks for sharing your experience with me! I've exactly the same problem :( Even register-hinting does not help either. That's why I rewrite some important pieces of code with assembler (in order of absence of better compilers).&lt;/P&gt;</description>
      <pubDate>Mon, 06 Feb 2017 23:44:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081586#M5708</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2017-02-06T23:44:07Z</dc:date>
    </item>
    <item>
      <title>You're welcome, mate. </title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081587#M5709</link>
      <description>&lt;P&gt;You're welcome, mate.&amp;nbsp; Hopefully things'll improve some as Microsoft continues modernizing their compiler code base.&amp;nbsp; My expectations are low, though.&lt;/P&gt;</description>
      <pubDate>Sun, 12 Feb 2017 23:14:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Question-about-latency/m-p/1081587#M5709</guid>
      <dc:creator>Todd_W_</dc:creator>
      <dc:date>2017-02-12T23:14:13Z</dc:date>
    </item>
  </channel>
</rss>

