- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
icc 17 does not vectorize the following simple code:
#include <stdio.h> #include <stdlib.h> int main(int argc, char const *argv[]) { #define SIZE 1024*1024*1024ULL unsigned long i; float *A,*B,*C; A = malloc(sizeof(float) * SIZE); B = malloc(sizeof(float) * SIZE); C = malloc(sizeof(float) * SIZE); for (i=0; i<SIZE; i++) C = A + B; printf("done %ld %f\n", SIZE, C[SIZE-1]); return 0; }
The vectorization report states:
LOOP BEGIN at main.c(15,5) remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop remark #25478: While Loop Unrolled by 2 LOOP END
, which does not make sense and looks like a bug. Reducing the SIZE value slightly (e.g., to 2^30-1), or removing the U from the integer qualifier at the end of the define (i.e., from ULL to LL) are workarounds.
Is this a known problem in icc? Are there any better workarounds?
Best regards,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unsigned integer loops are not strictly countable. Change your loop control variable i, and the limit SIZE to signed and same sizeof.
#define SIZE 1024LL*1024*1024
...
long long i;
...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While I agree with the principle of avoiding unsigned variables as loop indices, in this case there should be no problem with counting the loop -- all of the types involved are the same (unsigned long == unsigned long long) and the loop is not decrementing.
This code works finE on my systems as long as the SIZE variable is a smaller value than the one included above. I tested for values as large as 1023*1024*1024ULL and had no problems with vectorization.
The code has a few bugs -- it is missing two include files and the final printf is out of bounds (since the terminal value of "i" is equal to SIZE, not "SIZE-1"). None of these change the fundamental weirdness of finding 2^30 as the value for which vectorization is no longer attempted....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I second what Jim suggested to get the loop vectorized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
As it was already pointed by Jim there is a problem with declaration of the counter variable. A more portable solution is as follows:
...
size_t i;
...
This is no different than the original code, since "size_t" is unsigned.
To fix the problem (with this particular value of SIZE), *both* "i" and "SIZE" must be declared to be of signed types. If either "i" or the value of SIZE is unsigned, the same failure to vectorize occurs at a SIZE value of 1024*1024*1024.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
// Data Type Defines - Platform Dependent // Sizeof // Sizeof // 16/32-bit // 64-bit #define RTsize_t size_t // 4 // 8 #if ( defined ( _M_IX86_16 ) ) typedef short _RTV64C RTssize_t; // 2 // N/A typedef unsigned short _RTV64C RTusize_t; // 2 // N/A #endif #if ( defined ( _M_IX86 ) ) typedef __int32 _RTV64C RTssize_t; // 4 // N/A typedef unsigned __int32 _RTV64C RTusize_t; // 4 // N/A #endif #if ( defined ( _M_X64 ) || defined ( _M_AMD64 ) || defined ( _M_AMDX86_64 ) ) typedef __int64 RTssize_t; // N/A // 8 typedef unsigned __int64 RTusize_t; // N/A // 8 #endif #if ( defined ( _M_IA64 ) ) typedef __int64 RTssize_t; // N/A // 8 typedef unsigned __int64 RTusize_t; // N/A // 8 #endif #if ( defined ( _M_ARM ) || defined ( _M_SH ) || \ defined ( _M_MRX000 ) || defined ( _M_ALPHA ) || defined ( _M_PPC ) || \ defined ( _M_M68K ) ) typedef __int32 _RTV64C RTssize_t; // 4 // N/A typedef unsigned __int32 _RTV64C RTusize_t; // 4 // N/A #endif #if ( defined ( _M_QUANTUM ) ) // Defined by a Host Platform selected #endif
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all for your inputs. I updated my original post to fix the small mistakes in the original code that were pointed out (this is a small test code extracted from a larger application).
My concern is that despite the workarounds, this issue is very hard to debug for the developer, including an incorrect vectorization report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now that you know what to look for, most of the issues may be caught with a visual walk-through. I do not know if there is an option that can generate a report that suggest changes to choice of loop index variable type. Sometimes you will find that the std:: templates will give you problems (with respect to use of size_t as opposed to intptr_t).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is an especially irritating case because the loop does vectorize for smaller values of the SIZE parameter, and while the transition occurs at an "interesting" value (2^30 elements per array == 2^32 Bytes per array), it is not at all obvious why this should have an impact on vectorization -- especially in a code that uses only 64-bit variables.
@Sergey -- yes, there is definitely an issue with signed vs unsigned types here. Expanding the code to use all four combinations of 64-bit signed/unsigned loop index variables with 64-bit signed/unsigned loop limits shows that any use of an unsigned value inhibits vectorization when the SIZE value is set to 2^30 (but not for smaller values).
#include <stdlib.h> #include <stdio.h> #define USIZE 1024*1024*1024ULL #define SSIZE 1024*1024*1024LL int main(int argc, char const *argv[]) { unsigned long long i; long long j; float *A,*B,*C; float MiB; printf("sizeof(i) is %lu\n",sizeof(i)); MiB = (float) sizeof(float) * (float)USIZE / 1024.0 / 1024.0; printf("MiB per array is %f\n",MiB); A = malloc(sizeof(float) * USIZE); B = malloc(sizeof(float) * USIZE); C = malloc(sizeof(float) * USIZE); printf("done with mallocs\n"); for (i=0; i<USIZE; i++) { // line 22 A = 1.0; B = 2.0; C = 0.0; } printf("done with initialization\n"); // unsigned index, unsigned loop limit // line 30 for (i=0; i<USIZE; i++) C = A + B; // unsigned index, signed loop limit // line 34 for (i=0; i<SSIZE; i++) A = B + C; // signed index, unsigned loop limit // line 38 for (j=0; j<USIZE; j++) B= C + A ; // signed index, signed loop limit // line 42 for (j=0; j<SSIZE; j++) C = A + B ; i = USIZE/2; printf("done %ld %f\n", SSIZE, C); return 0; }
The code was compiled with icc 16.0.1 (17.0.2 showed identical behavior):
icc -O -xCORE-AVX2 -qopt-report-phase=vec test.c -o test
The test.optrpt file shows that only the last loop (signed index variable, signed loop limit) is vectorized when SIZE is 2^30.
Report from: Vector optimizations [vec]
LOOP BEGIN at test.c(42,5)
<Peeled loop for vectorization>
LOOP ENDLOOP BEGIN at test.c(42,5)
remark #15300: LOOP WAS VECTORIZED
LOOP ENDLOOP BEGIN at test.c(42,5)
<Remainder loop for vectorization>
LOOP END
Non-optimizable loops:
LOOP BEGIN at test.c(22,5)
remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
LOOP ENDLOOP BEGIN at test.c(30,5)
remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
LOOP ENDLOOP BEGIN at test.c(34,5)
remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
LOOP ENDLOOP BEGIN at test.c(38,5)
remark #15523: loop was not vectorized: loop control variable j was found, but loop iteration count cannot be computed before executing the loop
LOOP END
When I reduce the value of SIZE in the program above from 1024*1024*1024 to 1023*1024*1024, every loop vectorizes.
Weird things happen when the vector length is increased. For example, if I set the SIZE variables to 2^31 (or larger), the loop at 42 fails to vectorize with the erroneous message:
LOOP BEGIN at test.c(42,5)
remark #15315: loop was not vectorized: estimated number of iterations (1) is insufficient for vectorization
LOOP END
It vectorizes at SIZE=(2^31-1), which makes me suspect the compiler accidentally used a 32-bit variable in this part of the analysis?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update -- I was surprised not to have run into the failure of vectorization on loops with explicit lengths greater than 2^32-1 with the STREAM benchmark, so I went back and re-tested it.
It turns out that the problem *does* show up in the STREAM benchmark, but only in loops that don't contain OpenMP directives (or if the code is compiled without OpenMP support). As in the case above, a STREAM_ARRAY_SIZE of 2^31-1 vectorizes fine, while a STREAM_ARRAY_SIZE of 2^32 or larger does not vectorize. The optimization report messages are the same as I saw in the example above, e.g.:
LOOP BEGIN at stream_5-10.c(327,2)
remark #15315: loop was not vectorized: estimated number of iterations (1) is insufficient for vectorization
LOOP END
Since I don't think I have ever compiled STREAM with an array size >2^32 without also including the -qopenmp flag, I did not notice this before. This error has a minor impact on the current version of STREAM (5.10) -- the main loop in the validation routine will not vectorize due to this error if the STREAM_ARRAY_SIZE variable is 2^32 or larger. This does not change the reported performance, but it does slightly increase the time required for the validation. I don't know why I did not include an OpenMP directive on that loop before -- the next release will include the appropriate OpenMP pragma parallelize the main loop in the validation routine.
I filed a bug report using the new support system:
Case#: 02761654
Case Subject: Bug in the compiler vectorization phase with large constant loop bounds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good example John,
It would seem like the compiler is internally using 32-bit integers for qualification of loop counts as opposed to using the bit size of the values (or expressions) used in the loop. I would consider this a compiler bug.
Also, in the cases where 64-bit indexes are used (and the compiler recognizes this) but where the generated code may (could) produce code that requires 32-bit indexing (e.g. gather instructions of 32-bit values), that there may be an issue. However, in these cases, since the compiler can generate multiple code paths for different instruction sets (and/or peel and remainder codes), the compiler could equally well partition the loop into sections that work within 32-bit indexing, then span the intervening interval using lesser vectorized code, then resume fully vectorized.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The current version of STREAM in C is at
- http://www.cs.virginia.edu/stream/FTP/Code/stream.c
- This standard version of STREAM uses statically allocated arrays to avoid a number of optimization problems, but in some environments this causes additional difficulties with very large arrays (e.g., requiring "-mcmodel=medium").
- http://www.cs.virginia.edu/stream/FTP/Code/Versions/stream_5-10_posix_memalign.c
- This code is equivalent to the standard version, but uses "posix_memalign" for dynamic allocation.
Both versions use the preprocessor variable STREAM_ARRAY_SIZE to determine the array size at compile time. If this value is larger than 2^32-1, the definition will need to include a length suffix ("L" or "LL") to avoid trouble. (All of the systems that I use have 64-bit "long" variables in C. It is possible that systems with 32-bit "long" won't work as expected.)
The code is configured to support 64-bit array indices (using "ssize_t" for the index variables), and has been tested with array sizes of up to 4.08e10 (38.6 bits of index, 41.6 bits of byte address for each array, 43.2 bits of byte address for the 3 arrays together). This test used tiny bit more than 8.9 TiB of memory on a 16 TiB system.
In the standard version of stream.c, the loop starting at line 465 should have an OpenMP parallel for pragma, but doesn't. For STREAM_ARRAY_SIZE of 2^32 or greater this one won't be vectorized. The loops at 492, 511, 530 have the same problem, but don't typically get executed (they only exist to count the number of errors if the error norm exceeds the maximum allowed value).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The STREAM benchmark has always been intended to provide an indication of performance of simple, vectorizable code when run through real compilers. It is intended to be relatively easy to optimize, but performance differences between STREAM and "best case, hand-optimized" core are a feature, not a bug!
The performance numbers of 25 GB/s peak, 13 GB/s vectorized OpenMP, and 4 GB/s non-vector, single thread are a little unusual -- what sort of platform did you measure those on? The 25 GB/s peak looks like a dual-channel configuration with DDR3/1600. I have results from a Xeon E3-1270 (Sandy Bridge, 3.4 GHz, 4-core, 2-channel DDR3/1333) that has a peak BW of 21.33 GB/s and gets 13 GB/s with a single thread without streaming stores and 18 GB/s with a single thread with streaming stores. (Note that the 13 GB/s Triad value corresponds to an actual data traffic of 13*4/3=17.3 GB/s when write allocates are included, so it is in the expected range.) I don't have any results without vectorization on that system, but based on the Xeon E5-2680 (Sandy Bridge EP) results I would not expect to see any degradation from disabling vectorization for the single thread case (relative to the case without streaming stores -- more on this below).
Most of the Intel platforms that I have tested recently have relatively low penalties for failure to vectorize. (One exception is Intel's requirement for vectorization to get streaming stores, since Intel processors do not support a scalar streaming store instruction for data in xmm/ymm/zmm registers.) On Xeon E5 (v1), the best single-thread performance was obtained with vectorization disabled (though the difference was quite small). For Xeon E5 v3, vectorization is essential -- single-thread STREAM Triad performance improves from 12.8 GB/s with scalar code to 18.7 GB/s with SSE4.1 (128-bit) vectorization and to 18.9 GB/s with AVX2 (256-bit) vectorization.
A similar ratio applies to single-core performance on the Xeon Phi x200 (Knights Landing), but the overall story is much more complex.... If a user does not understand the need to use lots of threads to get memory bandwidth, then they probably should not be running on this system.
Late last week I tracked down the cause of a slight performance degradation with STREAM on 2-socket Intel systems with HyperThreading enabled and KMP_AFFINITY=scatter. We had observed that this was about 2% slower than we expected, but it was not obvious why. Using the Home Agent performance counters in the Xeon E5 v3 uncore, I was able to show that Transparent Huge Pages were the problem. With KMP_AFFINITY=scatter, consecutive threads are placed on alternating sockets, but with Transparent Huge Pages the entire 2MiB page is placed on the socket where the thread is running that is the first to access the page (when zeroing the array at the start of the program). So if the last cache line of a 2MiB page belongs to thread 1 running on socket 1, then the 2MiB page will be located on socket 1, even though thread 0 running on socket 0 is responsible for processing every other cache line in the page. For this test case, each array was composed of 80,000,000 doubles, which works out to 12.7 2MiB pages for each of the 24 threads to process in each array. The observed remote access rate was 4%, which seems reasonable for a 12.7-page "chunk size", and is consistent with the observed 2% performance degradation. The expected performance was restored by disabling large pages (making each "chunk" about 6500 4KiB pages) or by creating an explicit proclist that put the first half of the threads on separate cores on socket 0 and the second half of the threads in separate cores on socket 1. Either of these reduced the remote access rate to under 0.2%. This is exactly the sort of unexpected (performance-degrading) interaction of features that STREAM has always been intended to help find.....

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page