Hi Heinz,

Saidani__Tarik · ‎09-06-2012

Hi,

I've experienced a strange behaviour of the intel fortran compiler (ifort version 12.1.4) that I think is a bug

The with the label 10 in source code below is not vectorized by the 32-bit compiler because of a dependency whereas it is vectorized by the 64-bit compiler. This leads to differences in the output between the binary built using 32-bit compiler and the one built with the 64-bit compiler. Inserting a !DEC$ NOVECTOR before the loop prevents the compiler from vectorizing it and the 64-bit version no longer shows differences with the 32-bit. Could you please confirm that this is a bug in the compiler? If it is not the case could you give me some hints on how to avoid the issue. Many thanks Tarik

[fortran]

SUBROUTINE

CONV_DP (A, IA, B, IB, C, IC, N, M)

C

DOUBLE PRECISION A(*), B(*), C(*)

INTEGER IA, IB, IC, N, M

C C

INTEGER AI

C Pointer to A vector

INTEGER AJ

C Pointer to A vector

INTEGER BI

C Pointer to B vector

INTEGER CI

C Pointer to C vector

INTEGER I, J

C Indices for DO loops

CC

C Verify parameters

C

IF (N .LE. 0) GOTO 9999

IF (M .LE. 0) GOTO 9999

IF (IC .EQ. 0) GOTO 9999

C

C Initialize pointers C

AI = 1 BI = 1 CI = 1

C C To insure that C may overlay A, do the following C

IF (M .GT. 1) THEN

C

DO 20 I=1, N

BI = 1

AJ = AI

C(CI) = A(AJ) * B(BI)

C

DO 10 J=1, M-1

AJ = AJ + IA

BI = BI + IB

C(CI) = C(CI) + A(AJ) * B(BI)

10 CONTINUE

C

AI = AI + IA

CI = CI + IC

20 CONTINUE

C

ELSE

CALL VSMUL_DP (A, IA, B, C, IC, N)

ENDIF C

9999 CONTINUE

RETURN

END [/fortran]

Heinz_B_Intel · ‎09-06-2012

I tested your code using the 32bit and 64bit version of release 12.1.Update 5 and the just released 13.0 compiler: In all cases the loop gets vectorized as expected. It might well be, that your version ( 12.1. U4) was the last one which showed the problem. I don't have this compiler installed currently to test it. Thus please update to a newer release. In case you still see the difference,please send me the exact compilation line with all arguments you used

jimdempseyatthecove · ‎09-06-2012

Unless this routine is inlined (e.g. via IPO) then the strides (IA, IB and IC) would be unknown, and in particular not known to be 1. Therefore vectorization would not be possible unless you have a CPU and code generation capability supporting scatter/gather.I am not sure why the vectorization reports lists the loops as vectorized (excepting for when the strides are known to be 1). Please explain. Jim Dempsey

TimP · ‎09-06-2012

Recent compilers have introduced "simulated gather/scatter" (primarily for SSE4 and newer architectures) which could perform scalar moves to and from memory but perform the arithmetic by parallel instructions and report vectorization. I see nothing here which would indicate a possibility of numerical differences, unless you have violated the Fortran standard by overlapping the arguments, as the comment suggests. If you do that, you must set /assume:dummy_aliases to tell the compiler you aren't adhering to true Fortran. Yes, I know a few experienced developers who disbelieve this, even though the standard has implied this for 45 years. Other than that, too much is left unspecified here to draw a conclusion. I don't see how the code could be compiled as presented, although guesses could be made about the intent.

TimP · ‎09-06-2012

Well, there's another annoyance with the change in the forum format. No way to know whether we are in the linux or Windows forum wihile posting a reply. linux spelling of the option: -assume dummy_aliases

Heinz_B_Intel · ‎09-07-2012

Replying to Jim: There is no need for a "real" gather/scatter since the compiler can verify, that all array expressions in the loop body are linear relatively to the loop index (J): CI, AJ and BI are "induction" variables. This still doesn't help to use "vector" ( "packed") load and stores but address computation is simple and allows rather effiicient, individual ("scalar") data moves using a MOVSD and MOVHPD for one 128 bit SSE "block". The computation can be done packed. Thus the loop is "vectorized" at least to a big part. This is why the compiler emits the corresponding message.

Saidani__Tarik · ‎09-07-2012

Hi Heinz, I've tried with the new version of the compiler : ifort version 13.0.0 and the problem is still there i.e : not vectorized on 32-bit and vectorized on 64-bit. The compiler flags that I use are listed below : FFLAGS := -O3 -g -msse3 -axSSE4.2,AVX -ip -warn all -zero -align all \ -nogen-interfaces -extend-source -vec-report1 \ -openmp-report2 -auto \ -funroll-loops -finline-functions -debug extended -traceback \-fp-model precise -fp-model source -fp-stack-check \ -fstack-security-check -fimf-arch-consistency=true \ -check arg_temp_created -check uninit -check pointers \ -check format -check output_conversion -falign-stack=maintain-16-byte Many Thanks, Tarik

TimP · ‎09-07-2012

Just to repeat, in case you didn't read my previous reply: If you want to suppress vectorization which would run afoul of violating the Fortran standard on aliasing of dummy arguments, you must include -assume dummy_aliases in your option list. It's not a bug in the compiler when it performs optimizations in accordance with Fortran standard. If you are turning off gen-interfaces because it is detecting violations, please turn it on and fix those violations. You would have to compile with interprocedural analysis between caller and callee if you want the compiler to have a chance to detect an aliasing violation or take it into account without the option. Even in that case, I doubt you can count on satisfactory results unless you are willing to make your program standard compliant.

jimdempseyatthecove · ‎09-07-2012

To test for a potential compiler optimization problem, can you allocate (or declare) arrays A and B 1 (SSE) or 3 (AVX) elements larger i.e. add padds, and zero these additional elements but otherwise use the correct inuse value M for these array sizes. If this should correct the result, then this would be indicative of the synthesized gather for C(CI) = C(CI) + A(AJ) * B(BI) is in error. iow pulling in junk data following the end of the A and B arrays. The padding of 0.0's would be benign (not introduce error in result, though code is incorrect). Jim Dempsey

Heinz_B_Intel · ‎09-10-2012

Hi Tarik yes - I can confirm the difference now using 13.0: The key switches are "-O3 -fp-model source" - all the rest is irrelevant here. However in both cases I see the code being vectorized but differently. For 32bit, the outer loop is vectorized, for 64bit, the inner loop is vectorized and an unrolling is done. Using -O2, for 32bit no vectorization is done at all; for 64bit no change to O3 Please note, that this in general is not a compiller fault: Since you have twice the registers ( both GR and SSE) for 64bit, the heuristic of the code generation must be different since e.g. the amount of available SSE registers is a key parameter to decide for vectorization or not. And the report of the dependency for IA32 is correct too: The reduction inherently creates a dependence which can only be resolved by vectorizing the code - which isn't done here. However I'm wondering too, why "fp-model source" is not enforcing the same numerical result on both archictectures. Since I can't run the code ( no complete application), please do a check for me: Compile the code once for 64bit without any fp-model switch and once with "fp-model source" ( in both cases using -O3) and let me know, whether the results are different. In case they are, it would be a something to fix for sure Heinz

Saidani__Tarik · ‎09-10-2012

Hi Heinz, I have tried to build with and without fp-model source. Indeed, this makes a difference i.e: when not used, the results are the same whereas they are different when using fp-model source (between 32 and 64-bit). I have attached to files main.c (driver) and convo_dp.f (convolution code) called from main for you to test. Many Thanks Tarik

Heinz_B_Intel · ‎09-10-2012

Hi Tarik thanks. I will have a closes look and will come back to you Heinz

Heinz_B_Intel · ‎09-11-2012

Hi Tarik I looked at it again and I think I now got the issue you reported: See my attachment bug.tar which is a reduced version of your code ( and I replaced the driver by Fortran code). This compiles to incorrect code on Intel64 in the vectorizer but compiles correctly for IA32. The "fp-model source" switch is needed to show the bug. I escalated this issue with high priority to engineering: Case number is DPD200236211. I don't have any other workaround than disabling vectorization for the loop nest (unless you can do without the fp-model switch). I will inform you here when the issue is fixed. Please have a look at the attachment and let me know, whether there is an additional problem related to this sample code. Thanks a lot for submitting such a small reproducer. Heinz

Saidani__Tarik · ‎09-13-2012

Hi Heinz, Thanks for your answer, I'll put a !DEC$ NOVECTOR in the code for the meantime. Regards, Tarik

Heinz_B_Intel · ‎02-25-2013

The problem is fixed by the latest compiler release available from registrationcenter.intel.com ( download package l_fcompxe_2013.2.146 ). I will mark this thread as closed

different behaviour of the vectorizer between 32 and 64-bit compiler