- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
We are coming from https://stackoverflow.com/questions/74069410/why-is-transforming-an-array-using-avx-512-instructions-significantly-slower-whe, in which could not find the root cause of the question. It also already contains a few ideas of what could have been the issue, however we still cannot explain the effect.
I will therefore repost the question here.
Please consider the following minimal example minimal.cpp
(https://godbolt.org/z/qbW7q7xMa).
#include <immintrin.h>
#include <ctime>
#include <algorithm>
#include <iostream>
#include <vector>
#define NUMBER_OF_TUPLES 134'217'728UL
void transform_7(int64_t* input, double* output) {
for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += 7) {
size_t endOfBatch = std::min(startOfBatch + 7, NUMBER_OF_TUPLES);
for (size_t idx = startOfBatch; idx < endOfBatch;) {
output[idx] = static_cast<double>(input[idx]);
idx++;
}
asm volatile("" : : "r,m"(output) : "memory");
}
}
void transform_8(int64_t* input, double* output) {
for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += {
size_t endOfBatch = std::min(startOfBatch + 8, NUMBER_OF_TUPLES);
for (size_t idx = startOfBatch; idx < endOfBatch;) {
auto _loaded = _mm512_loadu_epi64(&input[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&output[idx], _converted);
idx += 8;
}
asm volatile("" : : "r,m"(output) : "memory");
}
}
void transform_9(int64_t* input, double* output) {
for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += 9) {
size_t endOfBatch = std::min(startOfBatch + 9, NUMBER_OF_TUPLES);
for (size_t idx = startOfBatch; idx < endOfBatch;) {
if (endOfBatch - idx >= {
auto _loaded = _mm512_loadu_epi64(&input[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&output[idx], _converted);
idx += 8;
} else {
output[idx] = static_cast<double>(input[idx]);
idx++;
}
}
asm volatile("" : : "r,m"(output) : "memory");
}
}
template <size_t batch_size>
void do_benchmark() {
auto* input = (int64_t*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(int64_t));
auto* output = (double*)aligned_alloc(64, NUMBER_OF_TUPLES * sizeof(double));
for (size_t i = 0; i < NUMBER_OF_TUPLES; ++i) {
input[i] = i;
}
for (size_t i = 0; i < NUMBER_OF_TUPLES; ++i) {
output[i] = 0;
}
asm volatile("" : : "r,m"(input) : "memory");
asm volatile("" : : "r,m"(output) : "memory");
auto t = std::clock();
if constexpr (batch_size == 7) {
transform_7(input, output);
} else if constexpr (batch_size == {
transform_8(input, output);
} else {
transform_9(input, output);
}
auto elapsed = std::clock() - t;
std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl;
}
int main() {
do_benchmark<7>();
do_benchmark<8>();
do_benchmark<9>();
}
It transforms the input
array of int64_t
to the output array of double
in batches of a given batch_size
. We have inserted the following AVX-512 intrinsics in case there are more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance.
auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);
Otherwise, we fall back to the scalar implementation.
To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory")
call, to make sure that the output data is flushed after each batch.
It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU
using
clang++ -std=c++20 -march=cascadelake -O3 minimal.cpp -o minimal
Executing the code, however, results in the following surprising output
Elapsed time for a batch size of 7: 200119
Elapsed time for a batch size of 8: 479755
Elapsed time for a batch size of 9: 216272
It shows, that for some reason, using a batch_size
of 8, the code is 2x slower. However, both, using a batch_size
of 7 or 9, is significantly faster.
This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page