<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic AVX-512 array transformation slower when transforming in batches of 8 compared to 7 or 9 in Mobile and Desktop Processors</title>
    <link>https://community.intel.com/t5/Mobile-and-Desktop-Processors/AVX-512-array-transformation-slower-when-transforming-in-batches/m-p/1425516#M59776</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We are coming from &lt;A href="https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe," target="_blank" rel="noopener"&gt;https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe,&lt;/A&gt; in which could not find the root cause of the question. It also already contains a few ideas of what could have been the issue, however we still cannot explain the effect.&lt;/P&gt;
&lt;P&gt;I will therefore repost the question here.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please consider the following minimal example &lt;CODE&gt;minimal.cpp&lt;/CODE&gt; (&lt;A href="https://godbolt.org/z/qbW7q7xMa" target="_self"&gt;https://godbolt.org/z/qbW7q7xMa&lt;/A&gt;).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;immintrin.h&amp;gt;
#include &amp;lt;ctime&amp;gt;

#include &amp;lt;algorithm&amp;gt;
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;vector&amp;gt;

#define NUMBER_OF_TUPLES 134'217'728UL

void transform_7(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch &amp;lt; NUMBER_OF_TUPLES; startOfBatch += 7) {
    size_t endOfBatch = std::min(startOfBatch + 7, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx &amp;lt; endOfBatch;) {
      output[idx] = static_cast&amp;lt;double&amp;gt;(input[idx]);
      idx++;
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

void transform_8(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch &amp;lt; NUMBER_OF_TUPLES; startOfBatch += &lt;LI-EMOJI id="lia_smiling-face-with-sunglasses" title=":smiling_face_with_sunglasses:"&gt;&lt;/LI-EMOJI&gt; {
    size_t endOfBatch = std::min(startOfBatch + 8, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx &amp;lt; endOfBatch;) {
      auto _loaded = _mm512_loadu_epi64(&amp;amp;input[idx]);
      auto _converted = _mm512_cvtepu64_pd(_loaded);

      _mm512_storeu_epi64(&amp;amp;output[idx], _converted);
      idx += 8;
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

void transform_9(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch &amp;lt; NUMBER_OF_TUPLES; startOfBatch += 9) {
    size_t endOfBatch = std::min(startOfBatch + 9, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx &amp;lt; endOfBatch;) {
      if (endOfBatch - idx &amp;gt;= &lt;LI-EMOJI id="lia_smiling-face-with-sunglasses" title=":smiling_face_with_sunglasses:"&gt;&lt;/LI-EMOJI&gt; {
        auto _loaded = _mm512_loadu_epi64(&amp;amp;input[idx]);
        auto _converted = _mm512_cvtepu64_pd(_loaded);

        _mm512_storeu_epi64(&amp;amp;output[idx], _converted);
        idx += 8;
      } else {
        output[idx] = static_cast&amp;lt;double&amp;gt;(input[idx]);
        idx++;
      }
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

template &amp;lt;size_t batch_size&amp;gt;
void do_benchmark() {
  auto* input = (int64_t*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(int64_t));
  auto* output = (double*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(double));

  for (size_t i = 0; i &amp;lt; NUMBER_OF_TUPLES; ++i) {
    input[i] = i;
  }

  for (size_t i = 0; i &amp;lt; NUMBER_OF_TUPLES; ++i) {
    output[i] = 0;
  }

  asm volatile("" : : "r,m"(input) : "memory");
  asm volatile("" : : "r,m"(output) : "memory");

  auto t = std::clock();

  if constexpr (batch_size == 7) {
    transform_7(input, output);
  } else if constexpr (batch_size == &lt;LI-EMOJI id="lia_smiling-face-with-sunglasses" title=":smiling_face_with_sunglasses:"&gt;&lt;/LI-EMOJI&gt; {
    transform_8(input, output);
  } else {
    transform_9(input, output);
  }

  auto elapsed = std::clock() - t;

  std::cout &amp;lt;&amp;lt; "Elapsed time for a batch size of " &amp;lt;&amp;lt; batch_size &amp;lt;&amp;lt; ": " &amp;lt;&amp;lt; elapsed &amp;lt;&amp;lt; std::endl;
}

int main() {
  do_benchmark&amp;lt;7&amp;gt;();
  do_benchmark&amp;lt;8&amp;gt;();
  do_benchmark&amp;lt;9&amp;gt;();
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It transforms the &lt;CODE&gt;input&lt;/CODE&gt; array of &lt;CODE&gt;int64_t&lt;/CODE&gt; to the output array of &lt;CODE&gt;double&lt;/CODE&gt; in batches of a given &lt;CODE&gt;batch_size&lt;/CODE&gt;. We have inserted the following AVX-512 intrinsics in case there are more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance.&lt;/P&gt;
&lt;PRE class="lang-cpp s-code-block"&gt;&lt;CODE class="hljs language-cpp"&gt;&lt;SPAN class="hljs-keyword"&gt;auto&lt;/SPAN&gt; _loaded = _mm512_loadu_epi64(&amp;amp;(*input)[idx]);
&lt;SPAN class="hljs-keyword"&gt;auto&lt;/SPAN&gt; _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&amp;amp;(*output)[idx], _converted);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Otherwise, we fall back to the scalar implementation.&lt;/P&gt;
&lt;P&gt;To make sure that the compiler doesn't collapse the two loops, we use the &lt;CODE&gt;asm volatile("" : : "r,m"(output-&amp;gt;data()) : "memory")&lt;/CODE&gt; call, to make sure that the output data is flushed after each batch.&lt;/P&gt;
&lt;P&gt;It is compiled and executed on an &lt;CODE&gt;Intel(R) Xeon(R) Gold 5220R CPU&lt;/CODE&gt; using&lt;/P&gt;
&lt;PRE class="lang-cpp s-code-block"&gt;&lt;CODE class="hljs language-cpp"&gt;clang++ -std=c++20 -march=cascadelake -O3 minimal.cpp -o minimal
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Executing the code, however, results in the following surprising output&lt;/P&gt;
&lt;PRE class="lang-cpp s-code-block"&gt;&lt;CODE class="hljs language-cpp"&gt;Elapsed time for a batch size of 7: 200119&lt;BR /&gt;Elapsed time for a batch size of 8: 479755&lt;BR /&gt;Elapsed time for a batch size of 9: 216272
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;It shows, that for some reason, using a &lt;CODE&gt;batch_size&lt;/CODE&gt; of 8, the code is 2x slower. However, both, using a &lt;CODE&gt;batch_size&lt;/CODE&gt; of 7 or 9, is significantly faster.&lt;/P&gt;
&lt;P&gt;This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?&lt;/P&gt;</description>
    <pubDate>Fri, 28 Oct 2022 07:02:59 GMT</pubDate>
    <dc:creator>bdaase</dc:creator>
    <dc:date>2022-10-28T07:02:59Z</dc:date>
    <item>
      <title>AVX-512 array transformation slower when transforming in batches of 8 compared to 7 or 9</title>
      <link>https://community.intel.com/t5/Mobile-and-Desktop-Processors/AVX-512-array-transformation-slower-when-transforming-in-batches/m-p/1425516#M59776</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We are coming from &lt;A href="https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe," target="_blank" rel="noopener"&gt;https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe,&lt;/A&gt; in which could not find the root cause of the question. It also already contains a few ideas of what could have been the issue, however we still cannot explain the effect.&lt;/P&gt;
&lt;P&gt;I will therefore repost the question here.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please consider the following minimal example &lt;CODE&gt;minimal.cpp&lt;/CODE&gt; (&lt;A href="https://godbolt.org/z/qbW7q7xMa" target="_self"&gt;https://godbolt.org/z/qbW7q7xMa&lt;/A&gt;).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;immintrin.h&amp;gt;
#include &amp;lt;ctime&amp;gt;

#include &amp;lt;algorithm&amp;gt;
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;vector&amp;gt;

#define NUMBER_OF_TUPLES 134'217'728UL

void transform_7(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch &amp;lt; NUMBER_OF_TUPLES; startOfBatch += 7) {
    size_t endOfBatch = std::min(startOfBatch + 7, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx &amp;lt; endOfBatch;) {
      output[idx] = static_cast&amp;lt;double&amp;gt;(input[idx]);
      idx++;
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

void transform_8(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch &amp;lt; NUMBER_OF_TUPLES; startOfBatch += &lt;LI-EMOJI id="lia_smiling-face-with-sunglasses" title=":smiling_face_with_sunglasses:"&gt;&lt;/LI-EMOJI&gt; {
    size_t endOfBatch = std::min(startOfBatch + 8, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx &amp;lt; endOfBatch;) {
      auto _loaded = _mm512_loadu_epi64(&amp;amp;input[idx]);
      auto _converted = _mm512_cvtepu64_pd(_loaded);

      _mm512_storeu_epi64(&amp;amp;output[idx], _converted);
      idx += 8;
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

void transform_9(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch &amp;lt; NUMBER_OF_TUPLES; startOfBatch += 9) {
    size_t endOfBatch = std::min(startOfBatch + 9, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx &amp;lt; endOfBatch;) {
      if (endOfBatch - idx &amp;gt;= &lt;LI-EMOJI id="lia_smiling-face-with-sunglasses" title=":smiling_face_with_sunglasses:"&gt;&lt;/LI-EMOJI&gt; {
        auto _loaded = _mm512_loadu_epi64(&amp;amp;input[idx]);
        auto _converted = _mm512_cvtepu64_pd(_loaded);

        _mm512_storeu_epi64(&amp;amp;output[idx], _converted);
        idx += 8;
      } else {
        output[idx] = static_cast&amp;lt;double&amp;gt;(input[idx]);
        idx++;
      }
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

template &amp;lt;size_t batch_size&amp;gt;
void do_benchmark() {
  auto* input = (int64_t*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(int64_t));
  auto* output = (double*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(double));

  for (size_t i = 0; i &amp;lt; NUMBER_OF_TUPLES; ++i) {
    input[i] = i;
  }

  for (size_t i = 0; i &amp;lt; NUMBER_OF_TUPLES; ++i) {
    output[i] = 0;
  }

  asm volatile("" : : "r,m"(input) : "memory");
  asm volatile("" : : "r,m"(output) : "memory");

  auto t = std::clock();

  if constexpr (batch_size == 7) {
    transform_7(input, output);
  } else if constexpr (batch_size == &lt;LI-EMOJI id="lia_smiling-face-with-sunglasses" title=":smiling_face_with_sunglasses:"&gt;&lt;/LI-EMOJI&gt; {
    transform_8(input, output);
  } else {
    transform_9(input, output);
  }

  auto elapsed = std::clock() - t;

  std::cout &amp;lt;&amp;lt; "Elapsed time for a batch size of " &amp;lt;&amp;lt; batch_size &amp;lt;&amp;lt; ": " &amp;lt;&amp;lt; elapsed &amp;lt;&amp;lt; std::endl;
}

int main() {
  do_benchmark&amp;lt;7&amp;gt;();
  do_benchmark&amp;lt;8&amp;gt;();
  do_benchmark&amp;lt;9&amp;gt;();
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It transforms the &lt;CODE&gt;input&lt;/CODE&gt; array of &lt;CODE&gt;int64_t&lt;/CODE&gt; to the output array of &lt;CODE&gt;double&lt;/CODE&gt; in batches of a given &lt;CODE&gt;batch_size&lt;/CODE&gt;. We have inserted the following AVX-512 intrinsics in case there are more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance.&lt;/P&gt;
&lt;PRE class="lang-cpp s-code-block"&gt;&lt;CODE class="hljs language-cpp"&gt;&lt;SPAN class="hljs-keyword"&gt;auto&lt;/SPAN&gt; _loaded = _mm512_loadu_epi64(&amp;amp;(*input)[idx]);
&lt;SPAN class="hljs-keyword"&gt;auto&lt;/SPAN&gt; _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&amp;amp;(*output)[idx], _converted);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Otherwise, we fall back to the scalar implementation.&lt;/P&gt;
&lt;P&gt;To make sure that the compiler doesn't collapse the two loops, we use the &lt;CODE&gt;asm volatile("" : : "r,m"(output-&amp;gt;data()) : "memory")&lt;/CODE&gt; call, to make sure that the output data is flushed after each batch.&lt;/P&gt;
&lt;P&gt;It is compiled and executed on an &lt;CODE&gt;Intel(R) Xeon(R) Gold 5220R CPU&lt;/CODE&gt; using&lt;/P&gt;
&lt;PRE class="lang-cpp s-code-block"&gt;&lt;CODE class="hljs language-cpp"&gt;clang++ -std=c++20 -march=cascadelake -O3 minimal.cpp -o minimal
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Executing the code, however, results in the following surprising output&lt;/P&gt;
&lt;PRE class="lang-cpp s-code-block"&gt;&lt;CODE class="hljs language-cpp"&gt;Elapsed time for a batch size of 7: 200119&lt;BR /&gt;Elapsed time for a batch size of 8: 479755&lt;BR /&gt;Elapsed time for a batch size of 9: 216272
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;It shows, that for some reason, using a &lt;CODE&gt;batch_size&lt;/CODE&gt; of 8, the code is 2x slower. However, both, using a &lt;CODE&gt;batch_size&lt;/CODE&gt; of 7 or 9, is significantly faster.&lt;/P&gt;
&lt;P&gt;This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2022 07:02:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Mobile-and-Desktop-Processors/AVX-512-array-transformation-slower-when-transforming-in-batches/m-p/1425516#M59776</guid>
      <dc:creator>bdaase</dc:creator>
      <dc:date>2022-10-28T07:02:59Z</dc:date>
    </item>
  </channel>
</rss>

