Missing vectorization with large, constant, loop trip counts

Matthias_D_ · ‎04-15-2017

Hello,

icc 17 does not vectorize the following simple code:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char const *argv[])
{
    #define SIZE 1024*1024*1024ULL

    unsigned long i;
    float *A,*B,*C;

    A = malloc(sizeof(float) * SIZE);
    B = malloc(sizeof(float) * SIZE);
    C = malloc(sizeof(float) * SIZE);

    for (i=0; i<SIZE; i++)
        C = A + B;

    printf("done %ld %f\n", SIZE, C[SIZE-1]);

    return 0;
}

The vectorization report states:

LOOP BEGIN at main.c(15,5)
   remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
   remark #25478: While Loop Unrolled by 2  
LOOP END

, which does not make sense and looks like a bug. Reducing the SIZE value slightly (e.g., to 2^30-1), or removing the U from the integer qualifier at the end of the define (i.e., from ULL to LL) are workarounds.

Is this a known problem in icc? Are there any better workarounds?

Best regards,

jimdempseyatthecove · ‎04-17-2017

Unsigned integer loops are not strictly countable. Change your loop control variable i, and the limit SIZE to signed and same sizeof.

#define SIZE 1024LL*1024*1024
...
long long i;
...

Jim Dempsey

McCalpinJohn · ‎04-17-2017

While I agree with the principle of avoiding unsigned variables as loop indices, in this case there should be no problem with counting the loop -- all of the types involved are the same (unsigned long == unsigned long long) and the loop is not decrementing.

This code works finE on my systems as long as the SIZE variable is a smaller value than the one included above. I tested for values as large as 1023*1024*1024ULL and had no problems with vectorization.

The code has a few bugs -- it is missing two include files and the final printf is out of bounds (since the terminal value of "i" is equal to SIZE, not "SIZE-1"). None of these change the fundamental weirdness of finding 2^30 as the value for which vectorization is no longer attempted....

Anoop_M_Intel · ‎04-17-2017

I second what Jim suggested to get the loop vectorized.

SergeyKostrov · ‎04-17-2017

As it was already pointed by Jim there is a problem with declaration of the counter variable. A more portable solution is as follows: ... size_t i; ...

SergeyKostrov · ‎04-17-2017

>>...While I agree with the principle of avoiding unsigned variables as loop indices... It was already proven ( a significant amount of time was spent on it ) that there is No any difference in generated codes when a counter variable is declared as 'int' or 'unsigned int' for simple for-loops. Matthias, please take a look at an article dedicated to that subject: What to do when Auto Vectorization fails? ( https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails ) Published on Jan 17, 2017

McCalpinJohn · ‎04-18-2017

Sergey Kostrov wrote:

As it was already pointed by Jim there is a problem with declaration of the counter variable. A more portable solution is as follows:
...
size_t i;
...

This is no different than the original code, since "size_t" is unsigned.

To fix the problem (with this particular value of SIZE), *both* "i" and "SIZE" must be declared to be of signed types. If either "i" or the value of SIZE is unsigned, the same failure to vectorize occurs at a SIZE value of 1024*1024*1024.

SergeyKostrov · ‎04-19-2017

This is not a matter of signed or unsigned because on: 16-bit platforms - sizeof( size_t ) = 2 32-bit platforms - sizeof( size_t ) = 4 64-bit platforms - sizeof( size_t ) = 8 As an extension of that universal type I use the following...

SergeyKostrov · ‎04-19-2017

 // Data Type Defines - Platform Dependent     // Sizeof     // Sizeof
                                               // 16/32-bit  // 64-bit
 #define RTsize_t      size_t                  //  4         // 8

 #if ( defined ( _M_IX86_16 ) )
 typedef short    _RTV64C  RTssize_t;          //  2         // N/A
 typedef unsigned short  _RTV64C  RTusize_t;   //  2         // N/A
 #endif

 #if ( defined ( _M_IX86 ) )
 typedef __int32    _RTV64C  RTssize_t;        //  4         // N/A
 typedef unsigned __int32 _RTV64C  RTusize_t;  //  4         // N/A
 #endif

 #if ( defined ( _M_X64 ) || defined ( _M_AMD64 ) || defined ( _M_AMDX86_64 ) )
 typedef __int64       RTssize_t;              //  N/A       // 8
 typedef unsigned __int64    RTusize_t;        //  N/A       // 8
 #endif

 #if ( defined ( _M_IA64 ) )
 typedef __int64       RTssize_t;              //  N/A       // 8
 typedef unsigned __int64    RTusize_t;        //  N/A       // 8
 #endif

 #if ( defined ( _M_ARM ) || defined ( _M_SH ) ||           \
    defined ( _M_MRX000 ) || defined ( _M_ALPHA ) || defined ( _M_PPC ) ||    \
    defined ( _M_M68K ) )
 typedef __int32    _RTV64C  RTssize_t;        //  4         // N/A
 typedef unsigned __int32 _RTV64C  RTusize_t;  //  4         // N/A
 #endif

 #if ( defined ( _M_QUANTUM ) )
 // Defined by a Host Platform selected
 #endif

Matthias_D_ · ‎04-19-2017

Thank you all for your inputs. I updated my original post to fix the small mistakes in the original code that were pointed out (this is a small test code extracted from a larger application).

My concern is that despite the workarounds, this issue is very hard to debug for the developer, including an incorrect vectorization report.

jimdempseyatthecove · ‎04-20-2017

Now that you know what to look for, most of the issues may be caught with a visual walk-through. I do not know if there is an option that can generate a report that suggest changes to choice of loop index variable type. Sometimes you will find that the std:: templates will give you problems (with respect to use of size_t as opposed to intptr_t).

Jim Dempsey

SergeyKostrov · ‎04-20-2017

>>...if there is an option that can generate a report that suggest changes to choice of loop index variable type... Use opt-report[ ={0-5} ] with n greater than 3 to get detailed information. I agree that these reports very often simply report what is going on and do not suggest explicitly what needs to be changed ( in some cases... ).

McCalpinJohn · ‎04-20-2017

This is an especially irritating case because the loop does vectorize for smaller values of the SIZE parameter, and while the transition occurs at an "interesting" value (2^30 elements per array == 2^32 Bytes per array), it is not at all obvious why this should have an impact on vectorization -- especially in a code that uses only 64-bit variables.

@Sergey -- yes, there is definitely an issue with signed vs unsigned types here. Expanding the code to use all four combinations of 64-bit signed/unsigned loop index variables with 64-bit signed/unsigned loop limits shows that any use of an unsigned value inhibits vectorization when the SIZE value is set to 2^30 (but not for smaller values).

#include <stdlib.h>
#include <stdio.h>
#define USIZE 1024*1024*1024ULL
#define SSIZE 1024*1024*1024LL
int main(int argc, char const *argv[])
{

	unsigned long long i;
	long long j;
    float *A,*B,*C;
	float MiB;

	printf("sizeof(i) is %lu\n",sizeof(i));
	MiB = (float) sizeof(float) * (float)USIZE / 1024.0 / 1024.0;
	printf("MiB per array is %f\n",MiB);

    A = malloc(sizeof(float) * USIZE);
    B = malloc(sizeof(float) * USIZE);
    C = malloc(sizeof(float) * USIZE);
	printf("done with mallocs\n");

    for (i=0; i<USIZE; i++) {			// line 22
		A = 1.0;
		B = 2.0;
		C = 0.0;
	}
	printf("done with initialization\n");

	// unsigned index, unsigned loop limit	// line 30
    for (i=0; i<USIZE; i++)
        C = A + B;

	// unsigned index, signed loop limit	// line 34
    for (i=0; i<SSIZE; i++)
        A = B + C;

	// signed index, unsigned loop limit	// line 38
    for (j=0; j<USIZE; j++)
        B = C + A;

	// signed index, signed loop limit	// line 42
    for (j=0; j<SSIZE; j++)
        C = A + B;

	i = USIZE/2;
    printf("done %ld %f\n", SSIZE, C);

    return 0;
}

The code was compiled with icc 16.0.1 (17.0.2 showed identical behavior):

icc -O -xCORE-AVX2 -qopt-report-phase=vec test.c -o test

The test.optrpt file shows that only the last loop (signed index variable, signed loop limit) is vectorized when SIZE is 2^30.

    Report from: Vector optimizations [vec]

LOOP BEGIN at test.c(42,5)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at test.c(42,5)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at test.c(42,5)
<Remainder loop for vectorization>
LOOP END

Non-optimizable loops:

LOOP BEGIN at test.c(22,5)
   remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
LOOP END

LOOP BEGIN at test.c(30,5)
   remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
LOOP END

LOOP BEGIN at test.c(34,5)
   remark #15523: loop was not vectorized: loop control variable i was found, but loop iteration count cannot be computed before executing the loop
LOOP END

LOOP BEGIN at test.c(38,5)
   remark #15523: loop was not vectorized: loop control variable j was found, but loop iteration count cannot be computed before executing the loop
LOOP END

When I reduce the value of SIZE in the program above from 1024*1024*1024 to 1023*1024*1024, every loop vectorizes.

Weird things happen when the vector length is increased. For example, if I set the SIZE variables to 2^31 (or larger), the loop at 42 fails to vectorize with the erroneous message:

LOOP BEGIN at test.c(42,5)
remark #15315: loop was not vectorized: estimated number of iterations (1) is insufficient for vectorization
LOOP END

It vectorizes at SIZE=(2^31-1), which makes me suspect the compiler accidentally used a 32-bit variable in this part of the analysis?

McCalpinJohn · ‎04-20-2017

Update -- I was surprised not to have run into the failure of vectorization on loops with explicit lengths greater than 2^32-1 with the STREAM benchmark, so I went back and re-tested it.

It turns out that the problem *does* show up in the STREAM benchmark, but only in loops that don't contain OpenMP directives (or if the code is compiled without OpenMP support). As in the case above, a STREAM_ARRAY_SIZE of 2^31-1 vectorizes fine, while a STREAM_ARRAY_SIZE of 2^32 or larger does not vectorize. The optimization report messages are the same as I saw in the example above, e.g.:

LOOP BEGIN at stream_5-10.c(327,2)
remark #15315: loop was not vectorized: estimated number of iterations (1) is insufficient for vectorization
LOOP END

Since I don't think I have ever compiled STREAM with an array size >2^32 without also including the -qopenmp flag, I did not notice this before. This error has a minor impact on the current version of STREAM (5.10) -- the main loop in the validation routine will not vectorize due to this error if the STREAM_ARRAY_SIZE variable is 2^32 or larger. This does not change the reported performance, but it does slightly increase the time required for the validation. I don't know why I did not include an OpenMP directive on that loop before -- the next release will include the appropriate OpenMP pragma parallelize the main loop in the validation routine.

I filed a bug report using the new support system:

Case#: 02761654
Case Subject: Bug in the compiler vectorization phase with large constant loop bounds.

jimdempseyatthecove · ‎04-21-2017

Good example John,

It would seem like the compiler is internally using 32-bit integers for qualification of loop counts as opposed to using the bit size of the values (or expressions) used in the loop. I would consider this a compiler bug.

Also, in the cases where 64-bit indexes are used (and the compiler recognizes this) but where the generated code may (could) produce code that requires 32-bit indexing (e.g. gather instructions of 32-bit values), that there may be an issue. However, in these cases, since the compiler can generate multiple code paths for different instruction sets (and/or peel and remainder codes), the compiler could equally well partition the loop into sections that work within 32-bit indexing, then span the intervening interval using lesser vectorized code, then resume fully vectorized.

Jim Dempsey

SergeyKostrov · ‎04-21-2017

>>... >>#define USIZE 1024*1024*1024ULL >>#define SSIZE 1024*1024*1024LL >>... How often in practice such declarations are used? Maybe in academics it is a defacto-standard but in a real software engineering real variables, 32-bit, 64-bit, etc types, need to be used to support a concept of dynamic programming.

SergeyKostrov · ‎04-21-2017

>>...STREAM benchmark... Where could I download the latest set of sources?

McCalpinJohn · ‎04-24-2017

The current version of STREAM in C is at

http://www.cs.virginia.edu/stream/FTP/Code/stream.c
- This standard version of STREAM uses statically allocated arrays to avoid a number of optimization problems, but in some environments this causes additional difficulties with very large arrays (e.g., requiring "-mcmodel=medium").
http://www.cs.virginia.edu/stream/FTP/Code/Versions/stream_5-10_posix_memalign.c
- This code is equivalent to the standard version, but uses "posix_memalign" for dynamic allocation.

Both versions use the preprocessor variable STREAM_ARRAY_SIZE to determine the array size at compile time. If this value is larger than 2^32-1, the definition will need to include a length suffix ("L" or "LL") to avoid trouble. (All of the systems that I use have 64-bit "long" variables in C. It is possible that systems with 32-bit "long" won't work as expected.)

The code is configured to support 64-bit array indices (using "ssize_t" for the index variables), and has been tested with array sizes of up to 4.08e10 (38.6 bits of index, 41.6 bits of byte address for each array, 43.2 bits of byte address for the 3 arrays together). This test used tiny bit more than 8.9 TiB of memory on a 16 TiB system.

In the standard version of stream.c, the loop starting at line 465 should have an OpenMP parallel for pragma, but doesn't. For STREAM_ARRAY_SIZE of 2^32 or greater this one won't be vectorized. The loops at 492, 511, 530 have the same problem, but don't typically get executed (they only exist to count the number of errors if the error norm exceeds the maximum allowed value).

SergeyKostrov · ‎04-25-2017

>>...The current version of STREAM in C is at... Thank you, John! It is a very small piece of codes...

SergeyKostrov · ‎04-26-2017

>>...The current version of STREAM in C is at... John, I will upload some performance results ( to Watercooler forum ) and let you know as soon as it is done. I think a benchmark that is not dependent on vectorization of inner loops is needed. The problem is that if an inner loop is Not vectorized and doesn't use OpenMP processing then performance numbers, that is memory bandwidth, are significantly lower when compared to theoretical numbers. For example, ~25GB/s ( theoretical ) vs. ~13GB/s ( with vectorization and OpenMP ) vs. ~4GB/s ( without vectorization and without OpenMP ) do not look good.

McCalpinJohn · ‎04-26-2017

The STREAM benchmark has always been intended to provide an indication of performance of simple, vectorizable code when run through real compilers. It is intended to be relatively easy to optimize, but performance differences between STREAM and "best case, hand-optimized" core are a feature, not a bug!

The performance numbers of 25 GB/s peak, 13 GB/s vectorized OpenMP, and 4 GB/s non-vector, single thread are a little unusual -- what sort of platform did you measure those on? The 25 GB/s peak looks like a dual-channel configuration with DDR3/1600. I have results from a Xeon E3-1270 (Sandy Bridge, 3.4 GHz, 4-core, 2-channel DDR3/1333) that has a peak BW of 21.33 GB/s and gets 13 GB/s with a single thread without streaming stores and 18 GB/s with a single thread with streaming stores. (Note that the 13 GB/s Triad value corresponds to an actual data traffic of 13*4/3=17.3 GB/s when write allocates are included, so it is in the expected range.) I don't have any results without vectorization on that system, but based on the Xeon E5-2680 (Sandy Bridge EP) results I would not expect to see any degradation from disabling vectorization for the single thread case (relative to the case without streaming stores -- more on this below).

Most of the Intel platforms that I have tested recently have relatively low penalties for failure to vectorize. (One exception is Intel's requirement for vectorization to get streaming stores, since Intel processors do not support a scalar streaming store instruction for data in xmm/ymm/zmm registers.) On Xeon E5 (v1), the best single-thread performance was obtained with vectorization disabled (though the difference was quite small). For Xeon E5 v3, vectorization is essential -- single-thread STREAM Triad performance improves from 12.8 GB/s with scalar code to 18.7 GB/s with SSE4.1 (128-bit) vectorization and to 18.9 GB/s with AVX2 (256-bit) vectorization.

A similar ratio applies to single-core performance on the Xeon Phi x200 (Knights Landing), but the overall story is much more complex.... If a user does not understand the need to use lots of threads to get memory bandwidth, then they probably should not be running on this system.

Late last week I tracked down the cause of a slight performance degradation with STREAM on 2-socket Intel systems with HyperThreading enabled and KMP_AFFINITY=scatter. We had observed that this was about 2% slower than we expected, but it was not obvious why. Using the Home Agent performance counters in the Xeon E5 v3 uncore, I was able to show that Transparent Huge Pages were the problem. With KMP_AFFINITY=scatter, consecutive threads are placed on alternating sockets, but with Transparent Huge Pages the entire 2MiB page is placed on the socket where the thread is running that is the first to access the page (when zeroing the array at the start of the program). So if the last cache line of a 2MiB page belongs to thread 1 running on socket 1, then the 2MiB page will be located on socket 1, even though thread 0 running on socket 0 is responsible for processing every other cache line in the page. For this test case, each array was composed of 80,000,000 doubles, which works out to 12.7 2MiB pages for each of the 24 threads to process in each array. The observed remote access rate was 4%, which seems reasonable for a 12.7-page "chunk size", and is consistent with the observed 2% performance degradation. The expected performance was restored by disabling large pages (making each "chunk" about 6500 4KiB pages) or by creating an explicit proclist that put the first half of the threads on separate cores on socket 0 and the second half of the threads in separate cores on socket 1. Either of these reduced the remote access rate to under 0.2%. This is exactly the sort of unexpected (performance-degrading) interaction of features that STREAM has always been intended to help find.....