AVX-512 array transformation slower when transforming in batches of 8 compared to 7 or 9

bdaase — Fri, 28 Oct 2022 07:02:59 GMT

Hi all,

We are coming from https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe, in which could not find the root cause of the question. It also already contains a few ideas of what could have been the issue, however we still cannot explain the effect.

I will therefore repost the question here.

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/qbW7q7xMa).

#include <immintrin.h> #include <ctime> #include <algorithm> #include <iostream> #include <vector> #define NUMBER_OF_TUPLES 134'217'728UL void transform_7(int64_t* input, double* output) { for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += 7) { size_t endOfBatch = std::min(startOfBatch + 7, NUMBER_OF_TUPLES); for (size_t idx = startOfBatch; idx < endOfBatch;) { output[idx] = static_cast<double>(input[idx]); idx++; } asm volatile("" : : "r,m"(output) : "memory"); } } void transform_8(int64_t* input, double* output) { for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += { size_t endOfBatch = std::min(startOfBatch + 8, NUMBER_OF_TUPLES); for (size_t idx = startOfBatch; idx < endOfBatch;) { auto _loaded = _mm512_loadu_epi64(&input[idx]); auto _converted = _mm512_cvtepu64_pd(_loaded); _mm512_storeu_epi64(&output[idx], _converted); idx += 8; } asm volatile("" : : "r,m"(output) : "memory"); } } void transform_9(int64_t* input, double* output) { for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += 9) { size_t endOfBatch = std::min(startOfBatch + 9, NUMBER_OF_TUPLES); for (size_t idx = startOfBatch; idx < endOfBatch;) { if (endOfBatch - idx >= { auto _loaded = _mm512_loadu_epi64(&input[idx]); auto _converted = _mm512_cvtepu64_pd(_loaded); _mm512_storeu_epi64(&output[idx], _converted); idx += 8; } else { output[idx] = static_cast<double>(input[idx]); idx++; } } asm volatile("" : : "r,m"(output) : "memory"); } } template <size_t batch_size> void do_benchmark() { auto* input = (int64_t*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(int64_t)); auto* output = (double*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(double)); for (size_t i = 0; i < NUMBER_OF_TUPLES; ++i) { input[i] = i; } for (size_t i = 0; i < NUMBER_OF_TUPLES; ++i) { output[i] = 0; } asm volatile("" : : "r,m"(input) : "memory"); asm volatile("" : : "r,m"(output) : "memory"); auto t = std::clock(); if constexpr (batch_size == 7) { transform_7(input, output); } else if constexpr (batch_size == { transform_8(input, output); } else { transform_9(input, output); } auto elapsed = std::clock() - t; std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl; } int main() { do_benchmark<7>(); do_benchmark<8>(); do_benchmark<9>(); }

It transforms the input array of int64_t to the output array of double in batches of a given batch_size. We have inserted the following AVX-512 intrinsics in case there are more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance.

auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);

Otherwise, we fall back to the scalar implementation.

To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory") call, to make sure that the output data is flushed after each batch.

It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU using

clang++ -std=c++20 -march=cascadelake -O3 minimal.cpp -o minimal

Executing the code, however, results in the following surprising output

Elapsed time for a batch size of 7: 200119
Elapsed time for a batch size of 8: 479755
Elapsed time for a batch size of 9: 216272

It shows, that for some reason, using a batch_size of 8, the code is 2x slower. However, both, using a batch_size of 7 or 9, is significantly faster.

This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?

topic AVX-512 array transformation slower when transforming in batches of 8 compared to 7 or 9 in Mobile and Desktop Processors

AVX-512 array transformation slower when transforming in batches of 8 compared to 7 or 9