Processors
Intel® Processors, Tools, and Utilities
14656 Discussions

AVX-512 array transformation slower when transforming in batches of 8 compared to 7 or 9

bdaase
Novice
458 Views

Hi all,

 

We are coming from https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe, in which could not find the root cause of the question. It also already contains a few ideas of what could have been the issue, however we still cannot explain the effect.

I will therefore repost the question here.

 

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/qbW7q7xMa).

 

 

 

#include <immintrin.h>
#include <ctime>

#include <algorithm>
#include <iostream>
#include <vector>

#define NUMBER_OF_TUPLES 134'217'728UL

void transform_7(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += 7) {
    size_t endOfBatch = std::min(startOfBatch + 7, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx < endOfBatch;) {
      output[idx] = static_cast<double>(input[idx]);
      idx++;
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

void transform_8(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch +=  {
    size_t endOfBatch = std::min(startOfBatch + 8, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx < endOfBatch;) {
      auto _loaded = _mm512_loadu_epi64(&input[idx]);
      auto _converted = _mm512_cvtepu64_pd(_loaded);

      _mm512_storeu_epi64(&output[idx], _converted);
      idx += 8;
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

void transform_9(int64_t* input, double* output) {
  for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += 9) {
    size_t endOfBatch = std::min(startOfBatch + 9, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx < endOfBatch;) {
      if (endOfBatch - idx >=  {
        auto _loaded = _mm512_loadu_epi64(&input[idx]);
        auto _converted = _mm512_cvtepu64_pd(_loaded);

        _mm512_storeu_epi64(&output[idx], _converted);
        idx += 8;
      } else {
        output[idx] = static_cast<double>(input[idx]);
        idx++;
      }
    }

    asm volatile("" : : "r,m"(output) : "memory");
  }
}

template <size_t batch_size>
void do_benchmark() {
  auto* input = (int64_t*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(int64_t));
  auto* output = (double*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(double));

  for (size_t i = 0; i < NUMBER_OF_TUPLES; ++i) {
    input[i] = i;
  }

  for (size_t i = 0; i < NUMBER_OF_TUPLES; ++i) {
    output[i] = 0;
  }

  asm volatile("" : : "r,m"(input) : "memory");
  asm volatile("" : : "r,m"(output) : "memory");

  auto t = std::clock();

  if constexpr (batch_size == 7) {
    transform_7(input, output);
  } else if constexpr (batch_size ==  {
    transform_8(input, output);
  } else {
    transform_9(input, output);
  }

  auto elapsed = std::clock() - t;

  std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl;
}

int main() {
  do_benchmark<7>();
  do_benchmark<8>();
  do_benchmark<9>();
}

 

It transforms the input array of int64_t to the output array of double in batches of a given batch_size. We have inserted the following AVX-512 intrinsics in case there are more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance.

auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);

Otherwise, we fall back to the scalar implementation.

To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory") call, to make sure that the output data is flushed after each batch.

It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU using

clang++ -std=c++20 -march=cascadelake -O3 minimal.cpp -o minimal

Executing the code, however, results in the following surprising output

Elapsed time for a batch size of 7: 200119
Elapsed time for a batch size of 8: 479755
Elapsed time for a batch size of 9: 216272

It shows, that for some reason, using a batch_size of 8, the code is 2x slower. However, both, using a batch_size of 7 or 9, is significantly faster.

This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?

Labels (1)
0 Kudos
0 Replies
Reply