- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FWIW this is another attempt to get some attention to an unfortunate vectorization-related issue with Intel C++ compiler. I've tried escalating the problem via Intel premier support (issue #6000162527), where it seems to have died (still no kind of technical feedback after 2 months).
I'm using Intel C++ compiler to develop a highly vectorized graphics application for an AVX512 (KNL) machine. My application makes heavy use of a C++ template abstraction layer which facilitates writing vectorized code using the AoS (Array of Structures approach).
Unfortunately, the current release (16.0.3) of ICPC generates exceedingly poor code using this approach, which has become a total blocker for my project. The C++ file at the bottom contains an example that exemplifies some of my difficulties on a trivial example.
The most relevant part of the attached file is this function which sums three streams of 3D vectors and stores it in a fourth stream.
void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) { // Extract a slice (Float16 * pointer) from each array, add, and store for (size_t i = 0; i < N; ++i) v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i); }
The file also contains all needed helper types, including a Wrapper around __m512 ("Float16"), a class that stores a number of Float16s on the heap ("DynamicArray"), and a Vector3 class that can represent 3D vectors of DynamicArrays, Float16 instances, or pointers to Float16 instances. I've tried to make it as compact as possible while demonstrating the poor code generation problem.
GCC and Clang generate excellent code for this. For instance, on GCC (g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s) the loop turns into
L3: vmovaps (%r11,%rax), %zmm0 vmovaps (%r10,%rax), %zmm1 vaddps 0(%r13,%rax), %zmm0, %zmm0 vmovaps (%r9,%rax), %zmm2 vaddps (%r12,%rax), %zmm1, %zmm1 vaddps (%rbx,%rax), %zmm2, %zmm2 vaddps (%rsi,%rax), %zmm0, %zmm0 vaddps (%rcx,%rax), %zmm1, %zmm1 vaddps (%rdx,%rax), %zmm2, %zmm2 vmovaps %zmm2, (%r14,%rax) vmovaps %zmm1, (%r15 ,%rax) vmovaps %zmm0, (%r8,%rax) leaq 64(%rax), %rax cmpq %rax, %rdi jne L3
i.e. 9 aligned loads, 6 adds, 3 aligned stores, plus loop-related instructions. Great! This is what I am expecting to get. For various reasons, I would like to use the Intel compiler for my project though.
Here is the output from the Intel Compiler for comparison (icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s)
# LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15 L_B1.3: # Preds L_B1.5 L_B1.2 movq (%r9), %r13 #77.32 c1 movq (%rsi), %r12 #77.18 c1 addq %rax, %r13 #77.18 c5 stall 1 movq 8(%rsi), %r11 #77.18 c5 movq 16(%rsi), %r10 #77.18 c5 addq %rax, %r12 #77.18 c5 vmovups (%r13), %zmm0 #77.46 c9 stall 1 movq 8(%r9), %r13 #77.32 c9 addq %rax, %r11 #77.18 c9 addq %rax, %r10 #77.18 c9 addq %rax, %r13 #77.18 c13 stall 1 vmovups (%r13), %zmm2 #77.46 c15 movq 16(%r9), %r13 #77.32 c15 addq %rax, %r13 #77.18 c19 stall 1 vmovups (%r13), %zmm4 #77.46 c21 movq (%rcx), %r13 #77.46 c21 addq %rax, %r13 #77.18 c25 stall 1 vaddps (%r13), %zmm0, %zmm1 #77.46 c27 vmovups %zmm1, (%rsp) #77.46 c33 stall 2 movq 8(%rcx), %r13 #77.46 c33 addq %rax, %r13 #77.18 c37 stall 1 vaddps (%r13), %zmm2, %zmm3 #77.46 c39 vmovups %zmm3, 64(%rsp) #77.46 c45 stall 2 movq 16(%rcx), %r13 #77.46 c45 addq %rax, %r13 #77.18 c49 stall 1 vaddps (%r13), %zmm4, %zmm5 #77.46 c51 vmovups %zmm5, 128(%rsp) #77.46 c57 stall 2 vmovups (%rsp), %zmm6 #77.46 c6 3 stall 2 vmovups %zmm6, 192(%rsp) #77.46 c69 stall 2 vmovups 64(%rsp), %zmm7 #77.46 c69 vmovups %zmm7, 256(%rsp) #77.46 c75 stall 2 vmovups 128(%rsp), %zmm8 #77.46 c75 vmovups %zmm8, 320(%rsp) #77.46 c81 stall 2 # LOE rax rdx rcx rbx rsi rdi r8 r9 r10 r11 r12 r14 r15 L_B1.4: # Preds L_B1.3 movq (%r8), %r13 #77.60 c1 vmovups 192(%rsp), %zmm0 #77.46 c1 addq %rax, %r13 #77.18 c5 stall 1 vmovups 256(%rsp), %zmm2 #77.60 c5 vaddps (%r13), %zmm0, %zmm1 #77.60 c7 vmovups %zmm1, 384(%rsp) #77.60 c13 stall 2 movq 8(%r8), %r13 #77.60 c13 addq %rax, %r13 #77.18 c17 stall 1 vmovups 320(%rsp), %zmm4 #77.60 c17 vaddps (%r13), %zmm2, %zmm3 #77.60 c19 vmovups %zmm3, 448(%rsp) #77.60 c25 stall 2 movq 16(%r8), %r13 #77.60 c25 addq %rax, %r13 #77.18 c29 stall 1 vaddps (%r13), %zmm4, %zmm5 #77.60 c31 vmovups %zmm5, 512(%rsp) #77.60 c37 stall 2 vmovups 384(%rsp), %zmm6 #77.60 c43 stall 2 vmovups %zmm6, 576(%rsp) #77.60 c49 stall 2 vmovups 448(%rsp), %zmm7 #77.60 c49 vmovups %zmm7, 640(%rsp) #77.60 c55 stall 2 vmovups 512(%rsp), %zmm8 #77.60 c55 vmovups %zmm8, 704(%rsp) #77.60 c61 stall 2 L_B1.5: # Preds L_B1.4 vmovups 576(%rsp), %zmm0 #77.21 c1 vmovups %zmm0, (%r12) #77.21 c7 stall 2 vmovups 640(%rsp), %zmm1 #77.21 c7 vmovups %zmm1, (%r11) #77.21 c13 stall 2 vmovups 704(%rsp), %zmm2 #77.21 c13 vmovups %zmm2, (%r10) #77.21 c19 stall 2 addq $1, %rdx #76.5 c19 addq $64, %rax #76.5 c19 cmpq %rdi, %rdx #76.5 c21 jb L_B1.3 # Prob 82% #76.5 c23 # LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15 L_B1.6: # Preds L_B1.5
Yikes! Needless to say, this excessive use of spilling and stack memory eliminates all benefits of vectorization. I have a hard time figuring out what exactly is the root cause, but ICPC seems to spill a large number of intermediate results on the stack even though it doesn't need to: there are plenty of registers available to do all of these computations without even touching the stack.
I'm quite desperate at this point and hope that this issue can be resolved somehow. I'd be happy to provide any kind of additional details if this would be helpful.
Thank you in advance,
Wenzel Jakob
This is the full program which reproduces the issue:
// Simple toy program which adds three streams of 3D vectors and writes to a // fourth stream // // Compiled with: // // Intel compiler: // $ icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s // // GCC: // $ g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s #include <cstring> #include <cstdlib> #include <cstddef> #include <functional> #include <immintrin.h> /// Wrapper around a AVX512 float vector (16x) struct alignas(64) Float16 { __m512 value; /// Add two Float16 vectors Float16 operator+(Float16 f) const { return Float16{_mm512_add_ps(value, f.value)}; } }; /// List of static arrays which are stored on the heap struct DynamicArray { Float16 *values; /// Get one "packet" by reference Float16 *packet(size_t i) { return values + i; } }; /// Array of Structures style 3D vector, can be templated with T=Float16, T=Float16* or T=DynamicArray template <typename T> struct Vector3 { public: /// Initialize with component values template <typename... Args> Vector3(Args &&... args) : values{ std::forward<Args>(args)... } {} /// Access component 'i' of the vector (normal path) template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0> auto& coeff(size_t i) { return values; } /// Ditto, const version template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0> const auto& coeff(size_t i) const { return values; } /// Access component 'i' of the vector (if it is a pointer, dereference it) template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0> auto& coeff(size_t i) { return *(values); } /// Ditto, const version template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0> const auto& coeff(size_t i) const { return *(values); } /// Assign component values of another 3D vector to this vector template <typename T2> Vector3& operator=(Vector3<T2> v2) { for (int i = 0; i < 3; ++i) coeff(i) = v2.coeff(i); return *this; } /// Get a slice of pointers to static arrays (at index i). Assumes that 'T' is a DynamicArray auto slice(size_t i) { return Vector3<Float16 *>( values[0].packet(i), values[1].packet(i), values[2].packet(i)); } /// Add two 3D vectors and return the result template <typename T2> auto operator+(const Vector3<T2> &v2) { return Vector3<decltype(coeff(0) + v2.coeff(0))>( coeff(0) + v2.coeff(0), coeff(1) + v2.coeff(1), coeff(2) + v2.coeff(2)); } T values[3]; }; using DynamicVector3 = Vector3<DynamicArray>; void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) { // Extract a slice (Float16 * pointer) from each array, add, and store for (size_t i = 0; i < N; ++i) v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i); }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is in our internal bugs database as DPD200413889.
The latest notes indicate it is actively being worked on and there is a suggested workaround, i.e.:
Investigation [2016-10-03 13:35:08 +03:00]:
===================================================================================
42 2 (*((P64.Float16_R_V$14*) (&(this_13577_V$44(P64)->values_V$4)(P64))(P64)))(P64) = (*((P64.Float16_R_V$14*) t0(P64)))(P64);
That comes from:
Vector3(Args &&... args) : values{ std::forward<Args>(args)... } {}
I can try to reduce the above to a simple assignment of *t0 to “this->values”, which might improve optimization. Or it might cause disam to behave incorrectly (type aliasing violation). I can try it out when I get a chance---maybe within a few weeks.
At the end of the day, however, I think Nikolay had a point in the CQ; the user might be able to get better performance in the client code (which is not shown) if they use the vectorization technology we already provide. SDLT took quite some time to get right in the compiler, and this may turn out to be a repeat of that effort.
Workaround [2016-10-03 13:16:05 +03:00()]:
===================================================================================
This code couldn't be vectorized, because it uses _mm512 intrinsic. I'd really like to recommend your customer to rewrite his code
in vector-friendly manner. AOS data layout is not vectorizer friendly and vector code will be always poor. He can try to use SDLT library,
which is available in 16.0. It allows him to write efficient code using c++ features and nice syntax.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Judith,
thank you for the update. Just a quick response to some of the suggestions brought up by the compiler developers:
I'd really like to recommend your customer to rewrite his code in vector-friendly manner. AOS data layout is not vectorizer friendly and vector code will be always poor.
Apologies, this was purely a typo/incorrect use of terminology on my part. My project in fact does use the SoA ("structure of arrays") data organization. The template code I posted is essentially part of a (much much larger) tool to make writing this type of code more convenient.
This template layer is (very intentionally) not using the builtin compiler vectorization. My project (a kind of memory-coherent Hamiltonian Monte Carlo algorithm) is simply too complicated to be handled with any kind of auto-vectorization. The entire architecture of the application is built with vectorization in mind, and the fine-grained control that is needed to accomplish that is easier to achieve with intrinsics.
The fact that a "std::forward<Args>(args)" call can basically turn off the register allocator is highly problematic for my application. I have for now switched to using Clang as a workaround (it produces decent code here), but this is not a good long-term solution for me. I really hope that this can be fixed in ICPC, and it seems to me that this could be helpful for other projects as well.
Best,
Wenzel
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page