Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

Extremely poor vectorized code generation by Intel C++ compiler (AVX512)

Jakob__Wenzel
Beginner
425 Views

FWIW this is another attempt to get some attention to an unfortunate vectorization-related issue with Intel C++ compiler. I've tried escalating the problem via Intel premier support (issue #6000162527), where it seems to have died (still no kind of technical feedback after 2 months).

I'm using Intel C++ compiler to develop a highly vectorized graphics application for an AVX512 (KNL) machine.  My application makes heavy use of a C++ template abstraction layer which facilitates writing vectorized code using the AoS (Array of Structures approach).

Unfortunately, the current release (16.0.3) of ICPC generates exceedingly poor code using this approach, which has become a total blocker for my project.  The C++ file at the bottom contains an example that exemplifies some of my difficulties on a trivial example. 

The most relevant part of the attached file is this function which sums three streams of 3D vectors and stores it in a fourth stream.

void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) {
    // Extract a slice (Float16 * pointer) from each array, add, and store
    for (size_t i = 0; i < N; ++i)
        v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i);
}

The file also contains all needed helper types, including a Wrapper around __m512 ("Float16"), a class that stores a number of Float16s on the heap ("DynamicArray"), and a Vector3 class that can represent 3D vectors of DynamicArrays, Float16 instances, or pointers to Float16 instances. I've tried to make it as compact as possible while demonstrating the poor code generation problem.

GCC and Clang generate excellent code for this. For instance, on GCC (g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s) the loop turns into

L3:
  vmovaps (%r11,%rax), %zmm0
  vmovaps (%r10,%rax), %zmm1
  vaddps 0(%r13,%rax), %zmm0, %zmm0
  vmovaps (%r9,%rax), %zmm2
  vaddps (%r12,%rax), %zmm1, %zmm1
  vaddps (%rbx,%rax), %zmm2, %zmm2
  vaddps (%rsi,%rax), %zmm0, %zmm0
  vaddps (%rcx,%rax), %zmm1, %zmm1
  vaddps (%rdx,%rax), %zmm2, %zmm2
  vmovaps %zmm2, (%r14,%rax)
  vmovaps %zmm1, (%r15 ,%rax)
  vmovaps %zmm0, (%r8,%rax)
  leaq 64(%rax), %rax
  cmpq %rax, %rdi
  jne L3

i.e. 9 aligned loads, 6 adds, 3 aligned stores, plus loop-related instructions. Great! This is what I am expecting to get. For various reasons, I would like to use the Intel compiler for my project though.

Here is the output from the Intel Compiler for comparison (icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s)

# LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15
L_B1.3: # Preds L_B1.5 L_B1.2
  movq (%r9), %r13 #77.32 c1
  movq (%rsi), %r12 #77.18 c1
  addq %rax, %r13 #77.18 c5 stall 1
  movq 8(%rsi), %r11 #77.18 c5
  movq 16(%rsi), %r10 #77.18 c5
  addq %rax, %r12 #77.18 c5
  vmovups (%r13), %zmm0 #77.46 c9 stall 1
  movq 8(%r9), %r13 #77.32 c9
  addq %rax, %r11 #77.18 c9 
  addq %rax, %r10 #77.18 c9
  addq %rax, %r13 #77.18 c13 stall 1
  vmovups (%r13), %zmm2 #77.46 c15
  movq 16(%r9), %r13 #77.32 c15
  addq %rax, %r13 #77.18 c19 stall 1 
  vmovups (%r13), %zmm4 #77.46 c21
  movq (%rcx), %r13 #77.46 c21
  addq %rax, %r13 #77.18 c25 stall 1
  vaddps (%r13), %zmm0, %zmm1 #77.46 c27
  vmovups %zmm1, (%rsp) #77.46 c33 stall 2
  movq 8(%rcx), %r13 #77.46 c33 
  addq %rax, %r13 #77.18 c37 stall 1
  vaddps (%r13), %zmm2, %zmm3 #77.46 c39
  vmovups %zmm3, 64(%rsp) #77.46 c45 stall 2
  movq 16(%rcx), %r13 #77.46 c45
  addq %rax, %r13 #77.18 c49 stall 1
  vaddps (%r13), %zmm4, %zmm5 #77.46 c51
  vmovups %zmm5, 128(%rsp) #77.46 c57 stall 2
  vmovups (%rsp), %zmm6 #77.46 c6 3 stall 2
  vmovups %zmm6, 192(%rsp) #77.46 c69 stall 2
  vmovups 64(%rsp), %zmm7 #77.46 c69
  vmovups %zmm7, 256(%rsp) #77.46 c75 stall 2
  vmovups 128(%rsp), %zmm8 #77.46 c75
  vmovups %zmm8, 320(%rsp) #77.46 c81 stall 2
# LOE rax rdx rcx rbx rsi rdi r8 r9 r10 r11 r12 r14 r15
L_B1.4: # Preds L_B1.3
  movq (%r8), %r13 #77.60 c1
  vmovups 192(%rsp), %zmm0 #77.46 c1 
  addq %rax, %r13 #77.18 c5 stall 1
  vmovups 256(%rsp), %zmm2 #77.60 c5
  vaddps (%r13), %zmm0, %zmm1 #77.60 c7
  vmovups %zmm1, 384(%rsp) #77.60 c13 stall 2
  movq 8(%r8), %r13 #77.60 c13
  addq %rax, %r13 #77.18 c17 stall 1
  vmovups 320(%rsp), %zmm4 #77.60 c17
  vaddps (%r13), %zmm2, %zmm3 #77.60 c19
  vmovups %zmm3, 448(%rsp) #77.60 c25 stall 2
  movq 16(%r8), %r13 #77.60 c25
  addq %rax, %r13 #77.18 c29 stall 1
  vaddps (%r13), %zmm4, %zmm5 #77.60 c31
  vmovups %zmm5, 512(%rsp) #77.60 c37 stall 2
  vmovups 384(%rsp), %zmm6 #77.60 c43 stall 2
  vmovups %zmm6, 576(%rsp) #77.60 c49 stall 2
  vmovups 448(%rsp), %zmm7 #77.60 c49
  vmovups %zmm7, 640(%rsp) #77.60 c55 stall 2
  vmovups 512(%rsp), %zmm8 #77.60 c55
  vmovups %zmm8, 704(%rsp) #77.60 c61 stall 2
L_B1.5: # Preds L_B1.4
  vmovups 576(%rsp), %zmm0 #77.21 c1
  vmovups %zmm0, (%r12) #77.21 c7 stall 2
  vmovups 640(%rsp), %zmm1 #77.21 c7
  vmovups %zmm1, (%r11) #77.21 c13 stall 2
  vmovups 704(%rsp), %zmm2 #77.21 c13
  vmovups %zmm2, (%r10) #77.21 c19 stall 2
  addq $1, %rdx #76.5 c19
  addq $64, %rax #76.5 c19
  cmpq %rdi, %rdx #76.5 c21
  jb L_B1.3 # Prob 82% #76.5 c23
# LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15
L_B1.6: # Preds L_B1.5

Yikes! Needless to say, this excessive use of spilling and stack memory eliminates all benefits of vectorization. I  have a hard time figuring out what exactly is the root cause, but ICPC seems to spill a large number  of intermediate results on the stack even though it doesn't need to: there are plenty of registers available to do all of these computations without even touching the stack.

I'm quite desperate at this point and hope that this issue can be resolved somehow. I'd be happy to provide any kind of additional details if this would be helpful.

Thank you in advance,
Wenzel Jakob

 

This is the full program which reproduces the issue:

// Simple toy program which adds three streams of 3D vectors and writes to a
// fourth stream
//
// Compiled with:
//
// Intel compiler:
//   $ icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s
//
// GCC:
//   $ g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s 


#include <cstring>
#include <cstdlib>
#include <cstddef>
#include <functional>
#include <immintrin.h>

/// Wrapper around a AVX512 float vector (16x)
struct alignas(64) Float16 {
    __m512 value;

    /// Add two Float16 vectors
    Float16 operator+(Float16 f) const {
        return Float16{_mm512_add_ps(value, f.value)};
    }
};

/// List of static arrays which are stored on the heap
struct DynamicArray {
    Float16 *values;

    /// Get one "packet" by reference
    Float16 *packet(size_t i) { return values + i; }
};

/// Array of Structures style 3D vector, can be templated with T=Float16, T=Float16* or T=DynamicArray
template <typename T> struct Vector3 {
public:
    /// Initialize with component values
    template <typename... Args>
    Vector3(Args &&... args) : values{ std::forward<Args>(args)... } {}

    /// Access component 'i' of the vector (normal path)
    template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0>
    auto& coeff(size_t i) { return values; }

    /// Ditto, const version
    template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0>
    const auto& coeff(size_t i) const { return values; }

    /// Access component 'i' of the vector (if it is a pointer, dereference it)
    template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0>
    auto& coeff(size_t i) { return *(values); }

    /// Ditto, const version
    template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0>
    const auto& coeff(size_t i) const { return *(values); }

    /// Assign component values of another 3D vector to this vector
    template <typename T2> Vector3& operator=(Vector3<T2> v2) {
        for (int i = 0; i < 3; ++i)
            coeff(i) = v2.coeff(i);
        return *this;
    }

    /// Get a slice of pointers to static arrays (at index i). Assumes that 'T' is a DynamicArray
    auto slice(size_t i) {
        return Vector3<Float16 *>(
            values[0].packet(i),
            values[1].packet(i),
            values[2].packet(i));
    }

    /// Add two 3D vectors and return the result
    template <typename T2> auto operator+(const Vector3<T2> &v2) {
        return Vector3<decltype(coeff(0) + v2.coeff(0))>(
            coeff(0) + v2.coeff(0),
            coeff(1) + v2.coeff(1),
            coeff(2) + v2.coeff(2));
    }

    T values[3];
};

using DynamicVector3 = Vector3<DynamicArray>;

void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) {
    // Extract a slice (Float16 * pointer) from each array, add, and store
    for (size_t i = 0; i < N; ++i)
        v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i);
}

 

0 Kudos
2 Replies
Judith_W_Intel
Employee
425 Views

 

This is in our internal bugs database as DPD200413889.

The latest notes indicate it is actively being worked on and there is a suggested workaround, i.e.:

Investigation    [2016-10-03 13:35:08 +03:00]:
===================================================================================

It looks like a lot of this damage is caused by using a “&(*(..))” pattern to cast one pointer into another kind of pointer:

 

42      2               (*((P64.Float16_R_V$14*) (&(this_13577_V$44(P64)->values_V$4)(P64))(P64)))(P64) = (*((P64.Float16_R_V$14*) t0(P64)))(P64);

 

That comes from:

 

   Vector3(Args &&... args) : values{ std::forward<Args>(args)... } {}

 

I can try to reduce the above to a simple assignment of *t0 to “this->values”, which might improve optimization. Or it might cause disam to behave incorrectly (type aliasing violation). I can try it out when I get a chance---maybe within a few weeks.

 

At the end of the day, however, I think Nikolay had a point in the CQ; the user might be able to get better performance in the client code (which is not shown) if they use the vectorization technology we already provide. SDLT took quite some time to get right in the compiler, and this may turn out to be a repeat of that effort.


Workaround    [2016-10-03 13:16:05 +03:00()]:
===================================================================================
This code couldn't be vectorized, because it uses _mm512 intrinsic. I'd really like to recommend your customer to rewrite his code
in vector-friendly manner. AOS data layout is not vectorizer friendly and vector code will be always poor. He can try to use SDLT library,
which is available in 16.0. It allows him to write efficient code using c++ features and nice syntax.

 

0 Kudos
Jakob__Wenzel
Beginner
425 Views

Dear Judith,

thank you for the update. Just a quick response to some of the suggestions brought up by the compiler developers:

I'd really like to recommend your customer to rewrite his code  in vector-friendly manner. AOS data layout is not vectorizer friendly and vector code will be always poor. 

Apologies, this was purely a typo/incorrect use of terminology on my part. My project in fact does use the SoA ("structure of arrays") data organization. The template code I posted is essentially part of a (much much larger) tool to make writing this type of code more convenient.

This template layer is (very intentionally) not using the builtin compiler vectorization. My project (a kind of memory-coherent Hamiltonian Monte Carlo algorithm) is simply too complicated to be handled with any kind of auto-vectorization. The entire architecture of the application is built with vectorization in mind, and the fine-grained control that is needed to accomplish that is easier to achieve with intrinsics.

The fact that a "std::forward<Args>(args)" call can basically turn off the register allocator is highly problematic for my application. I have for now switched to using Clang as a workaround (it produces decent code here), but this is not a good long-term solution for me. I really hope that this can be fixed in ICPC, and it seems to me that this could be helpful for other projects as well.

Best,
Wenzel

 

0 Kudos
Reply