Optimization causes array index out of bounds

Alexander_K_2 · ‎03-19-2014

I'm still testing the ffte benchmark. Running it on the host with -O3 -xAVX -openmp works flawlessly and the performance looks great. Now I wanted to use that code on the Intel Xeon Phi. So I replaced the -xAVX option with -mmic in order to create a native binary. With ulimit and KMP_STACKSIZE I increased the stack to avoid a stack overflow.

Running the code on the Phi gives the following error (-g -traceback):
$ ./speed1d
N =
10
forrtl: severe (154): array index out of bounds
Image              PC                Routine            Line        Source
speed1d            000000000049252B Unknown               Unknown Unknown
speed1d            0000000000490EB4 Unknown               Unknown Unknown
speed1d            000000000045BE07 Unknown               Unknown Unknown
speed1d            000000000043AAC5 Unknown               Unknown Unknown
speed1d            000000000040F721 Unknown               Unknown Unknown
libpthread.so.0    00007F47F8B1B800 Unknown               Unknown Unknown
speed1d            000000000040AD3F fft5a_                    170 kernel.f
speed1d            00000000004061A3 fft235_                   148 fft235.f
speed1d            0000000000403F25 zfft1d_                    56 zfft1d.f
speed1d            0000000000403634 MAIN__                     31 speed1d.f
speed1d            000000000040346C Unknown               Unknown Unknown
libc.so.6          00007F47F85CF634 Unknown               Unknown Unknown
speed1d            0000000000403369 Unknown               Unknown Unknown

Compiling it with -O2 gives the same error. When I use O1 or O0 everything works without any problems. When I run a debug test on the O3 build using gdb on the Phi, I noticed that the input variable N is somehow not correctly initialized after the READ call in the code. Thus the error may occur later in the program.

I tried -heap-arrays without effect. Using check all didn't work because it disables optimization and therefore not detecting any problems.

Can you point out the direction in which I need to investigate?

Ron_Green · ‎03-19-2014

I need more information. First, compiler version and MPSS version.

Next, what is your setting for OMP_NUM_THREADS on the Phi and on the host where it worked. Same number of threads?

Finally, it would help if I could try this myself. Is this the right link: http://openbenchmarking.org/test/pts/ffte ; and are you using 1.0.1 or the older 1.0.0 version?

ron

Alexander_K_2 · ‎03-19-2014

Hi Ron,

I am using the lates ffte Version 6.0 from here: http://ffte.jp/

The compiler reports ifort (IFORT) 14.0.1 20131008. The MPSS is version 2.1 I think.

Originally I had not set the OMP_NUM_THREADS. I just did a quick test with OMP_NUM_THREADS=4 on the host and on the mic. The error still appears on the mic.

I hope these information help you.

Alexander_K_2 · ‎03-20-2014

I have run some additional debugging today. Running the program with an input size(N) of multiples of 16 so that the access matches the cache line size gives no error. Other values still do. When I remove all "!DIR$ VECTOR ALIGNED" from the code than it works fine and performs better than on the O1 level.

Is this a compiler problem or a code problem? The arrays are not declared aligned but the error still exists when the code is compiled with -align or -align array64byte. The option -opt-assume-safe-padding was never specified.

jimdempseyatthecove · ‎03-20-2014

The error message: forrtl: severe (154): array index out of bounds

Is not generated due to an alignment issue, rather this is an indication that you have enabled runtime array bounds checking (good for development testing) and some section of code is accessing the array with an index that is out of the range of permitted indexes. This test is performed by additional code (asserts) inserted by the compiler.

If the input variable N is being misread, e.g. read as 0 or negative number (possibly leftover junk), a declaration of an array or allocation, with size of .le. 0 would yield an empty array (any reference is out of bounds).

Note, had you not had array bounds checking enabled, your program may have run without noticeable error trashing code and data in the process.

Note, it might be advised to add a sanity check on the input data such as N, and issue a meaningful error message.

Jim Dempsey

jimdempseyatthecove · ‎03-20-2014

By the way, do not run your multi-threaded code with stack size set to ulimit or with overly large values.

Stack size is not dynamic, it is fixed at thread startup (or specified at thread startup using different api). If the first thread gets all of (or half of) virtual memory for stack, where do the 2nd and later threads get their stack from?

Jim Dempsey

TimP · ‎03-20-2014

Since the task runs OK on the host (presumably with somewhere between 4 and 20 threads), an increase in ulimit -s for MIC is more likely to be useful than an increase in OMP_STACKSIZE (same thing as KMP_STACKSIZE). The latter would default to 4GB on both host and MIC. As Jim hinted, if you have say 120 threads on MIC, increasing the stack of each thread by 4GB would consume 480GB more of what you allowed in ulimit.

I never heard a resolution of discussions about making it easier to increase ulimit -s on MIC (or simply making the default bigger).

Alexander_K_2 · ‎03-21-2014

Indeed, too large values for KMP_STACKSIZE are not advisable. I did not see the increase of the stacksize as a solution to the problem but as the arrays are created on the stack I need to increase it in order to run the program.

What it does not explain (or I just don't get it) is, that when I use the code unchanged, it always fails with O2 or O3 but when I remove the "!DIR$ VECTOR ALIGNED" statements it works.

Another interesting fact is, that when I run the code with OMP_NUM_THREADS=1, I get good performance on both MIC and host CPU. But when I run the code, e.g. on the host, with as many threads as cores, taskset reports, that they get pinned to two cores and the performance decreases by factor 10. I have not found the reason for this though.

jimdempseyatthecove · ‎03-23-2014

"!DIR$ VECTOR ALIGNED"

Is a guarantee by you to the compiler that the data is indeed aligned. And as such, the compiler can generate aligned vector instructions without inserting a preamble of code in front of the loop to test for data alignment (and take appropriate different code path depending on results of alignment test)

Should your data not be aligned, then the aligned data instructions would GP fault.

*** You are not seeing GP faults ***
*** You are seeing index out of range ***

This sounds suspicious of data, that is used for indexing an array, is being corrupted, or miscalculated, or not yet calculated when used.

Assume some portion of your code is calculating indices and placing them into an array
Assume further that index is dependent on index[x-1]

Then this would introduce a temporal (time) dependency in calculation of the indices such that it may not be safe to use vectorized code to calculate the indices (without taking the temporal issues into consideration).

RE: But when I run the code, e.g. on the host, with as many threads as cores, taskset reports, that they get pinned to two cores and the performance decreases by factor 10. I have not found the reason for this though.

What is your host processor? Does it have HT? Is HT enabled? What are your OMP and KMP environment variables?

Is the host a server where an system administrator may restrict the number of logical processors?

Reduction in performance can be indicative of oversubscription of threads or cache line evictions or thread synchronization issues.

Jim Dempsey

Alexander_K_2 · ‎03-26-2014

The host is a dual socket system with two 8 core processors and HT (32 threads). I already got the performance decrease when I set OMP_NUM_THREADS to 2 and KMP_AFFINITY to scatter.

jimdempseyatthecove · ‎03-26-2014

Can you insert an assert into kernel.f, FFT5A as first statement:

IF(SIZE(W) .LT. 4*L) THEN
PRINT *,"SIZE(W) = ", SIZE(W), " .LT. ", 4*L
STOP
ENDIF

Jim Dempsey