Solved: Question about write SIMD code mannually

Raymond_S_ · ‎04-22-2016

Dear all:

We all know that Intel compiler can automatic vectorize code. My Question is: when does a developer have to write SIMD code mannually instead of Automatic Vectorization? which situation the compiler can not automatic vectorize?

I remember there is an article which explain this question in Intel developer zone , but I can not find that article.

Pls help me, thanks.

McCalpinJohn · ‎04-22-2016

The answer is simple: "it depends..."

I often use SIMD intrinsics or inline assembler when I want to be able to control exactly what instructions get executed. This is not a normal use case, but it is common in performance analysis.

For "real" codes, I have decided to use SIMD intrinsics a few times because the compiler was unable to produce good code from any of the variations of the source code that I tried.

In one case (computing the squared-magnitude of a vector of float complex values in interleaved storage format) the compiler vectorized the code, but did it so badly that the performance was slower than the scalar code. Using a few SIMD intrinsics gave me a pretty good (~2x) speedup. This falls into the category of code that needs to move data around within the vector registers.

In another case (computing a wide sliding sum for each element of an input vector) the compiler vectorized the code, but did not realize that it was possible to use a "recursive doubling" technique to reduce the operation count. (I.e., to compute 64-wide sums, the compiler generated code that performed 64 adds for each element. The recursive doubling approach computes 8-wide sums using 7 adds, then adds 8-wide sums to get 16-wide sums, then adds 16-wide sums to get 32-wide sums, and finally adds 32-wide sums to get 64-wide sums. This requires only 10 adds per element instead of the 64 that the compiler generates. I was unable to find a way to construct C-language source code that would enable the compiler to find this transformation on its own.) This falls into the category of code that needs to move data around within the vector registers and into the category of code that can make use of higher-level transformations that are not easily expressed in C.

In another case, the compiler was unable to vectorize code that stored the indices of the elements of a vector that were flagged by a comparison. (The compiler had no trouble vectorizing the comparison and no trouble doing merges or replacements based on the results of the comparison, but computing and storing the indices of the results where the compare was true was too much for it.) I show how I vectorized this code at https://software.intel.com/en-us/forums/intel-c-compiler/topic/609838 ; The key here was that I knew that the comparison would be true infrequently, so I was able to structure the code to exploit this and to use sneaky bit manipulation tricks to handle the cases for which 1 or 2 elements had a "true" comparison.

View solution in original post

McCalpinJohn · ‎04-22-2016

The answer is simple: "it depends..."

I often use SIMD intrinsics or inline assembler when I want to be able to control exactly what instructions get executed. This is not a normal use case, but it is common in performance analysis.

For "real" codes, I have decided to use SIMD intrinsics a few times because the compiler was unable to produce good code from any of the variations of the source code that I tried.

In one case (computing the squared-magnitude of a vector of float complex values in interleaved storage format) the compiler vectorized the code, but did it so badly that the performance was slower than the scalar code. Using a few SIMD intrinsics gave me a pretty good (~2x) speedup. This falls into the category of code that needs to move data around within the vector registers.

In another case (computing a wide sliding sum for each element of an input vector) the compiler vectorized the code, but did not realize that it was possible to use a "recursive doubling" technique to reduce the operation count. (I.e., to compute 64-wide sums, the compiler generated code that performed 64 adds for each element. The recursive doubling approach computes 8-wide sums using 7 adds, then adds 8-wide sums to get 16-wide sums, then adds 16-wide sums to get 32-wide sums, and finally adds 32-wide sums to get 64-wide sums. This requires only 10 adds per element instead of the 64 that the compiler generates. I was unable to find a way to construct C-language source code that would enable the compiler to find this transformation on its own.) This falls into the category of code that needs to move data around within the vector registers and into the category of code that can make use of higher-level transformations that are not easily expressed in C.

In another case, the compiler was unable to vectorize code that stored the indices of the elements of a vector that were flagged by a comparison. (The compiler had no trouble vectorizing the comparison and no trouble doing merges or replacements based on the results of the comparison, but computing and storing the indices of the results where the compare was true was too much for it.) I show how I vectorized this code at https://software.intel.com/en-us/forums/intel-c-compiler/topic/609838 ; The key here was that I knew that the comparison would be true infrequently, so I was able to structure the code to exploit this and to use sneaky bit manipulation tricks to handle the cases for which 1 or 2 elements had a "true" comparison.