<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why do programs using AVX instructions degrade more severely under cold-start conditions? in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Why-do-programs-using-AVX-instructions-degrade-more-severely/m-p/1440451#M8134</link>
    <description>&lt;P&gt;As mentioned in &lt;A href="https://www.agner.org/optimize/blog/read.php?i=415#427" target="_self"&gt;agner's blog&lt;/A&gt;&amp;nbsp;and &lt;A href="https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)" target="_self"&gt;wikichip&lt;/A&gt;, there is a warm-up phase, in which AVX-related instructions cannot be executed at full rate.&lt;BR /&gt;My tests also support this point of view.&lt;BR /&gt;Now I want to know if there is any official Intel documentation that mentions this warm-up phase?&lt;/P&gt;</description>
    <pubDate>Wed, 21 Dec 2022 12:07:37 GMT</pubDate>
    <dc:creator>Jipeng-Zhang</dc:creator>
    <dc:date>2022-12-21T12:07:37Z</dc:date>
    <item>
      <title>Why do programs using AVX instructions degrade more severely under cold-start conditions?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Why-do-programs-using-AVX-instructions-degrade-more-severely/m-p/1439750#M8133</link>
      <description>&lt;P&gt;I'm implementing elliptic curve cryptography algorithms (i.e., X25519 and Ed25519) using the AVX-512IFMA instructions.&lt;BR /&gt;When I compared my vectorized implementation with the x86-64 assembly version, I found that my vectorized implementation suffered severe performance degradation under cold start conditions.&lt;BR /&gt;The warm start test means that the function is executed 1000 times to load the instruction and data cache before starting to record the CPU cycle (CC).&lt;BR /&gt;A cold start test means executing the function directly (without loading caches) and recording its CC.&lt;/P&gt;
&lt;P&gt;My tests show that the x86-64 assembly implementation suffers little performance degradation under cold start conditions.&lt;BR /&gt;However, the vectorized implementation degrades performance by a factor of 2~3 under cold-start conditions; in other words, the CC under cold-start conditions is about 2~3 times higher than the CC under warm-start conditions.&lt;/P&gt;
&lt;P&gt;In order to explore the cause of this problem, I made the following attempts.&lt;/P&gt;
&lt;P&gt;Attempt 1: Measure their code size.&lt;BR /&gt;I found that the code size of the vectorized implementation and the x86-64 implementation are both close to 32KB (also the size of the L1I cache).&lt;BR /&gt;At the same time, I also tested their L1I misses using the perf tool. I found that the vectorized implementation is about 5320 times, while the x86-64 implementation is about 5100 times. The gap between these two metrics is not particularly large, so this is not enough to explain our problem.&lt;/P&gt;
&lt;P&gt;Attempt 2: Analysis using the topdown analysis method&lt;BR /&gt;I found that under cold start conditions, the performance bottlenecks of both are CPU front-end and CPU back-end, and their ratios are close. The CPU front-end is 31.3% and 31.3% and the CPU back-end is 34% and 28.1% for the vectorized implementation and the x86-64 implementation, respectively.&lt;BR /&gt;The relevant results of the topdown analysis method can not explain my problem.&lt;/P&gt;
&lt;P&gt;Attempt 3: Analyze Instruction Encoding&lt;BR /&gt;For x86-64 assembly instructions, the encoding length of `add %al, (%rax)` is 2 bytes, and the encoding length of `add %al, 0x53ab6345 (%rdx)` is 6 bytes.&lt;BR /&gt;For AVX-512 instructions, the encoding length of `vpaddq (%rdx), %zmm7, %zmm4` is 6 bytes, and the encoding length of `vpaddq 0x40(%rdx), %zmm5, %zmm3` is 7 bytes.&lt;BR /&gt;On average, the code length of AVX-512 instructions is longer than that of x86-64.&lt;BR /&gt;But this discovery seems to be of little significance, because their code size is close.&lt;/P&gt;
&lt;P&gt;Now, I don't know how to explain this problem, can you give me some ideas?&lt;BR /&gt;I know this might be a very complicated question, but it would be very grateful if you could give me some ideas.&lt;/P&gt;
&lt;P&gt;I can provide more relevant data if required.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Dec 2022 13:30:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Why-do-programs-using-AVX-instructions-degrade-more-severely/m-p/1439750#M8133</guid>
      <dc:creator>Jipeng-Zhang</dc:creator>
      <dc:date>2022-12-19T13:30:18Z</dc:date>
    </item>
    <item>
      <title>Re: Why do programs using AVX instructions degrade more severely under cold-start conditions?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Why-do-programs-using-AVX-instructions-degrade-more-severely/m-p/1440451#M8134</link>
      <description>&lt;P&gt;As mentioned in &lt;A href="https://www.agner.org/optimize/blog/read.php?i=415#427" target="_self"&gt;agner's blog&lt;/A&gt;&amp;nbsp;and &lt;A href="https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)" target="_self"&gt;wikichip&lt;/A&gt;, there is a warm-up phase, in which AVX-related instructions cannot be executed at full rate.&lt;BR /&gt;My tests also support this point of view.&lt;BR /&gt;Now I want to know if there is any official Intel documentation that mentions this warm-up phase?&lt;/P&gt;</description>
      <pubDate>Wed, 21 Dec 2022 12:07:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Why-do-programs-using-AVX-instructions-degrade-more-severely/m-p/1440451#M8134</guid>
      <dc:creator>Jipeng-Zhang</dc:creator>
      <dc:date>2022-12-21T12:07:37Z</dc:date>
    </item>
  </channel>
</rss>

