<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic performance degradation after spin wait in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/performance-degradation-after-spin-wait/m-p/1459357#M8149</link>
    <description>&lt;P&gt;Hi, I'm tunning my program for low-latency.&lt;/P&gt;
&lt;P&gt;I have a tight calculation function calc(); which is using SIMD floating point instructions heavily.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I had test the performance of calc(); using perf command. it shows that this calc function is using ~10k instructions and ~5k cpu cycles in average.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, when I put this calc function after a spin-wait like&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;while(true) {
  if (!flag.load(std::memory_order_acquire)) {
      continue;
  }

  calc();
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the calc part is using about 10k cycles. and other perf counters like `l1d-cache-misses`, `llc-misses`, `branch-misses` and `instructions` remain the same.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Can anyone help me to explain how this happened and what should I do to avoid this? I mean to keep the calc function as fast as possible.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, I have 2 interesting findings:&lt;/P&gt;
&lt;P&gt;1. If I got the flag variable set in a very short period(less than 1ms). I cannot notice any performance degradation for function calc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. if I add some garbage simd floating point calcution in the middle of spin-wait. I can achieve the expected performance.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sat, 25 Feb 2023 15:16:52 GMT</pubDate>
    <dc:creator>VariantF</dc:creator>
    <dc:date>2023-02-25T15:16:52Z</dc:date>
    <item>
      <title>performance degradation after spin wait</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/performance-degradation-after-spin-wait/m-p/1459357#M8149</link>
      <description>&lt;P&gt;Hi, I'm tunning my program for low-latency.&lt;/P&gt;
&lt;P&gt;I have a tight calculation function calc(); which is using SIMD floating point instructions heavily.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I had test the performance of calc(); using perf command. it shows that this calc function is using ~10k instructions and ~5k cpu cycles in average.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, when I put this calc function after a spin-wait like&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;while(true) {
  if (!flag.load(std::memory_order_acquire)) {
      continue;
  }

  calc();
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the calc part is using about 10k cycles. and other perf counters like `l1d-cache-misses`, `llc-misses`, `branch-misses` and `instructions` remain the same.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Can anyone help me to explain how this happened and what should I do to avoid this? I mean to keep the calc function as fast as possible.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, I have 2 interesting findings:&lt;/P&gt;
&lt;P&gt;1. If I got the flag variable set in a very short period(less than 1ms). I cannot notice any performance degradation for function calc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. if I add some garbage simd floating point calcution in the middle of spin-wait. I can achieve the expected performance.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Feb 2023 15:16:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/performance-degradation-after-spin-wait/m-p/1459357#M8149</guid>
      <dc:creator>VariantF</dc:creator>
      <dc:date>2023-02-25T15:16:52Z</dc:date>
    </item>
    <item>
      <title>Re: performance degradation after spin wait</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/performance-degradation-after-spin-wait/m-p/1459359#M8150</link>
      <description>&lt;P&gt;My CPU is 13900K. I also tested at 12900K and Ice Lake CPUs like Xeon 8368. looks they have the same behaviour.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I noticed from `Optimization Reference Manual` that there's something called `Thread Director` which can automatically detect the thread classes in runtime and there's a special class called`&amp;nbsp;Pause (spin-wait) dominated code`. I don't know if this is related but looks like after some time period, the CPU detected that the thread is in a spin-wait loop and then reduced the resource that is allocated to this thread ?&lt;/P&gt;</description>
      <pubDate>Sat, 25 Feb 2023 15:22:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/performance-degradation-after-spin-wait/m-p/1459359#M8150</guid>
      <dc:creator>VariantF</dc:creator>
      <dc:date>2023-02-25T15:22:21Z</dc:date>
    </item>
  </channel>
</rss>

