Hi Brandon,

Manuel_P_ · ‎05-15-2014

Hi guys, I condensed our project down to a piece of code that lets you reproduce the following issue. When I compile this in Release configuration (Debug works), I get this compiler error: 1>------ Build started: Project: ng-gtest, Configuration: Release x64 ------ 1> CilkTest.cpp 1>" : error : 010101_239 1> 1> compilation aborted for General\CilkTest.cpp (code 4) ========== Build: 0 succeeded, 1 failed, 3 up-to-date, 0 skipped ========== This is our compiler: Intel(R) C++ Intel(R) 64 Compiler XE for Intel(R) 64, version 14.0.3 Package ID: w_ccompxe_2013_sp1.3.202 OS: Windows 7, x64. This is the code: #include const int VecSize = 8; const short* acdata; const short* lowdata; const unsigned short* meas_data; const unsigned short* rdval; short trident[2 * VecSize]; short speed[2 * VecSize]; float spdfact[VecSize]; float spdfact2[VecSize]; float tdat[VecSize]; float array1[VecSize]; float array2[VecSize]; const float *input_01; const float *input_02; float val0; float vvv9; float agn; void get_g(float ag[VecSize], const float ae[VecSize], const float* pp) { float a01[VecSize]; a01[:] = ae[:]; if (a01[:] >= 360.0f) a01[:] -= 360.0f; if (a01[:] >= 360.0f) a01[:] -= 360.0f; if (a01[:] < 0.0f) a01[:] += 360.0f; if (a01[:] < 0.0f) a01[:] += 360.0f; int i0[VecSize], i1[VecSize]; i1[:] = static_cast(a01[:]); i0[:] = i1[:] + 1; float g0[VecSize], g1[VecSize]; g1[:] = pp[i1[:]]; g0[:] = pp[i0[:]]; ag[:] = g0[:] - (g1[:] - g0[:]) * (a01[:] - static_cast(i0[:])); } void f(float prlo[VecSize], const int cntr) { float cop[VecSize]; short cod[2 * VecSize]; short maxm[2 * VecSize]; cod[0:VecSize:2] = acdata[cntr:VecSize]; cod[1:VecSize:2] = lowdata[cntr:VecSize]; maxm[0:VecSize:2] = lowdata[cntr:VecSize]; maxm[1:VecSize:2] = acdata[cntr:VecSize]; cop[:] = (1.0f / float(255 * 255)) * static_cast( cod[0:VecSize:2] * trident[0:VecSize:2] + cod[1:VecSize:2] * trident[1:VecSize:2]); float music[VecSize]; music[:] = (1.0f / float(16383 * 16383)) * static_cast( maxm[0:VecSize:2] * speed[0:VecSize:2] + maxm[1:VecSize:2] * speed[1:VecSize:2]); float velo2[VecSize]; float brigh[VecSize]; velo2[:] = static_cast(rdval[cntr:VecSize]) * spdfact[:]; brigh[:] = asinf(velo2[:]); float denom[VecSize]; denom[:] = cop[:] * array1[:] + array2[:] / brigh[:]; float accel[VecSize]; accel[:] = atanf(music[:] / denom[:]); accel[:] = atan2f(accel[:], velo2[:]); bool haMask[VecSize]; haMask[:] = accel[:] < 0.0f; if (haMask[:] & (music[:] > 0.0f)) accel[:] += float(9.81 / 2); if (!haMask[:] & (music[:] < 0.0f)) accel[:] -= float(9.81 / 2); float diff[VecSize]; diff[:] = array1[:] * brigh[:] - array2[:] * cop[:] * velo2[:]; float prod[VecSize]; prod[:] = sinf(accel[:]) * diff[:]; float accel2[VecSize]; accel2[:] = atanf(prod[:] / music[:]); float halter[VecSize] = { 0.0f }; if (cop[:] <= 0.0f) halter[:] = 2.8182963f; float valter[VecSize]; valter[:] = cop[:] > 0 ? velo2[:] - tdat[:] : velo2[:] + tdat[:]; if (music[:] == 0) { accel[:] = halter[:]; accel2[:] = valter[:]; } accel[:] *= float(9.81); accel2[:] *= float(9.81); float v8[VecSize]; v8[:] = 25.7385f - accel2[:]; float hxx[VecSize]; hxx[:] = fabsf(accel[:]) * float(1.38e-23); float hnn[VecSize]; hnn[:] = -accel[:]; float vgg[VecSize], gx2[VecSize], hms[VecSize], sv[VecSize]; get_g(vgg, accel2, input_02); get_g(gx2, v8, input_02); get_g(hms, hnn, input_01); sv[:] = ((1.0f - hxx[:]) * (val0 - vgg[:])) + (hxx[:] * (vvv9 - gx2[:])); float p0[VecSize]; p0[:] = hms[:] - sv[:]; prlo[:] = static_cast(meas_data[cntr:VecSize]) * spdfact2[:] - p0[:] - agn; }

Barry_T_Intel · ‎05-15-2014

I've been unable to reproduce the problem. I extracted the code into a file, and issued the following command:

bash-3.2$ icl test.cpp
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0 Build 20140303
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

test.cpp
Microsoft (R) Incremental Linker Version 9.00.30729.01
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:test.exe
test.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol main referenced in function __tmainCRTStartup
test.exe : fatal error LNK1120: 1 unresolved externals

So the file was successfully compiled and failed in the linker. The problem may already be fixed. I'm using a nightly compiler build from March 3 which may be newer than the compiler you've got. To be sure, I'll need a Visual Studio build log to reproduce the command line options.

- Barry

Brandon_H_Intel · ‎05-15-2014

I'm using the same compiler with Microsoft Visual Studio* 2013, and I'm not seeing a problem:

1>------ Build started: Project: test, Configuration: Release x64 ------

1> icl /Qvc12 "/Qlocation,link,C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64" /Zi /W3 /O2 /Oi /Qipo /Qftz- -D __INTEL_COMPILER=1400 -D WIN32 -D NDEBUG -D _CONSOLE -D _LIB -D _UNICODE -D UNICODE /EHsc /MD /GS /Gy /Zc:wchar_t /Zc:forScope /Fox64\Release\ /Fdx64\Release\vc120.pdb /TP test.cpp

1>

1> Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.202 Build 20140422

1>

1> test.cpp

========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

Can you turn off the Startup banner (/nologo) and see if your command line matches mine above? And are you using a different Visual Studio version?

Manuel_P_ · ‎05-16-2014

Hi I think you need vectorization to reproduce the problem. I use Visual Studio Premium 2013. This is my output: 1>------ Build started: Project: ng-gtest, Configuration: Release x64 ------ 1> icl /Qvc12 "/Qlocation,link,D:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64" /Zi /W3 /MP /O3 /Oi /Qip /Qftz -D __INTEL_COMPILER=1400 -D WIN32 -D NDEBUG -D _CONSOLE -D NG_EXPORTS -D NG_DLL_ID=1 -D USE_TBB_PARALLEL -D USE_TBB_RWLOCK -D NOMINMAX -D BOOST_FILESYSTEM_NO_DEPRECATED -D _WINDLL -D _SCL_SECURE_NO_WARNINGS -D BOOST_MULTI_INDEX_DISABLE_SERIALIZATION -D NOMINMAX -D _WINDLL -D _VARIADIC_MAX=10 -D _UNICODE -D UNICODE /EHsc /MD /GS /fp:precise /QxSSE3 /Zc:wchar_t /Zc:forScope /Qstd=c++11 /Qrestrict /Fo.\tmp\msvc_x64_ur\ /Fd.\tmp\msvc_x64_ur\vc120.pdb /TP General\CilkTest.cpp 1> 1> Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.202 Build 20140422 1> Copyright (C) 1985-2014 Intel Corporation. All rights reserved. 1> 1> CilkTest.cpp 1> *** Compiling Cilk test code in Debug (fails to compile in Release with w_ccompxe_2013_sp1.3.202. 1>" : error : 010101_239 1> 1> compilation aborted for General\CilkTest.cpp (code 4) ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

Barry_T_Intel · ‎05-16-2014

/fp:precise seems to be the critical option. I've submitted CQ256573 on this problem.

Thank you for reporting it.

- Barry

Manuel_P_ · ‎05-16-2014

You are right, /fp:precise and /fp:strict don't work. /fp:fast and /fp:fast=2 both work.

Thanks, Barry, for your investigations.

Cheers,

Martin

Brandon_H_Intel · ‎05-16-2014

I see it too. Looks like something that broke between the 12.1 and 13.0 compilers. Unfortunately, I don't see any easy workarounds beyond not using /fp:precise, which isn't ideal. When there's progress made on the investigation here, I'll update the thread.

Manuel_P_ · ‎05-16-2014

Hi Brandon,

unfortunately it's worse than I thought. Although /fp:fast compiles successfully, the resulting code is buggy.

I was able to get the code to compile with /fp:precise by splitting the function into smaller pieces and using __declspec(noinline), and this binary passed our tests. The binary with /fp:fast caused an access violation, no matter if I split the function or not.

From that I would conclude that there is an issue with CilkPlus which leads to either an internal error or buggy code.

Unfortunately I cannot provide test data at this stage. It's hard to extract this from our test environment.

Regards,
Martin

Brandon_H_Intel · ‎05-16-2014

Martin,

It may be a good idea to check what kind of access violation it is. If it's a bad address, then we're probably stuck until we can get a test case from you, but if it's a stack overflow or something along those lines, it might be easier to workaround and reproduce

Manuel_P_ · ‎05-19-2014

Hi Brandon,

the stack seems to be corrupt. The line "get_g(gx2, v8, input_02);" is compiled into these instructions.

000007FEE36DA28A movsxd      rbp,dword ptr [rsp+20h]
000007FEE36DA28F movsxd      r10,dword ptr [rsp+30h]
000007FEE36DA294 movsxd      rdi,dword ptr [rsp+24h]
000007FEE36DA299 movsxd      r11,dword ptr [rsp+34h]
000007FEE36DA29E movsxd      r8,dword ptr [rsp+28h]
000007FEE36DA2A3 movsxd      r12,dword ptr [rsp+38h]
000007FEE36DA2A8 vmovss      xmm2,dword ptr [rax+rbp*4]
000007FEE36DA2AD vmovss      xmm15,dword ptr [rax+rbp*4+4]
000007FEE36DA2B3 vmovss      xmm9,dword ptr [rax+r10*4]
000007FEE36DA2B9 vmovss      xmm4,dword ptr [rax+r10*4+4]
000007FEE36DA2C0 movsxd      r9,dword ptr [rsp+2Ch]
000007FEE36DA2C5 movsxd      r14,dword ptr [rsp+3Ch]
000007FEE36DA2CA vinsertps   xmm7,xmm2,dword ptr [rax+rdi*4],10h
000007FEE36DA2D1 vinsertps   xmm5,xmm15,dword ptr [rax+rdi*4+4],10h
000007FEE36DA2D9 vinsertps   xmm8,xmm9,dword ptr [rax+r11*4],10h
000007FEE36DA2E0 vinsertps   xmm2,xmm4,dword ptr [rax+r11*4+4],10h
000007FEE36DA2E8 vinsertps   xmm6,xmm7,dword ptr [rax+r8*4],20h
000007FEE36DA2EF vinsertps   xmm3,xmm5,dword ptr [rax+r8*4+4],20h
000007FEE36DA2F7 vinsertps   xmm1,xmm8,dword ptr [rax+r12*4],20h
000007FEE36DA2FE vinsertps   xmm7,xmm2,dword ptr [rax+r12*4+4],20h
000007FEE36DA306 vinsertps   xmm10,xmm6,dword ptr [rax+r9*4],30h
000007FEE36DA30D vinsertps   xmm6,xmm3,dword ptr [rax+r9*4+4],30h
000007FEE36DA315 vinsertps   xmm14,xmm1,dword ptr [rax+r14*4],30h
000007FEE36DA31C vinsertps   xmm9,xmm7,dword ptr [rax+r14*4+4],30h

r11 has this value: 0xffffffff80000000, therefore the instruction at address 000007FEE36DA2D9 (vinsertps xmm8,xmm9,dword ptr [rax+r11*4],10h) causes an access violation.

As you can see, r11 is loaded from the stack with movsxd r11,dword ptr [rsp+34h], and the stack at address [rsp+34h] is: 00 00 00 80.

Regards,

Martin

Manuel_P_ · ‎05-19-2014

Hi Brandon,

the stack seems to be corrupt.

The source code line "get_g(gx2, v8, input_02);" gets translated into these instructions:

000007FEE36DA299 movsxd      r11,dword ptr [rsp+34h]
000007FEE36DA29E movsxd      r8,dword ptr [rsp+28h]
000007FEE36DA2A3 movsxd      r12,dword ptr [rsp+38h]
000007FEE36DA2A8 vmovss      xmm2,dword ptr [rax+rbp*4]
000007FEE36DA2AD vmovss      xmm15,dword ptr [rax+rbp*4+4]
000007FEE36DA2B3 vmovss      xmm9,dword ptr [rax+r10*4]
000007FEE36DA2B9 vmovss      xmm4,dword ptr [rax+r10*4+4]
000007FEE36DA2C0 movsxd      r9,dword ptr [rsp+2Ch]
000007FEE36DA2C5 movsxd      r14,dword ptr [rsp+3Ch]
000007FEE36DA2CA vinsertps   xmm7,xmm2,dword ptr [rax+rdi*4],10h
000007FEE36DA2D1 vinsertps   xmm5,xmm15,dword ptr [rax+rdi*4+4],10h
000007FEE36DA2D9 vinsertps   xmm8,xmm9,dword ptr [rax+r11*4],10h
000007FEE36DA2E0 vinsertps   xmm2,xmm4,dword ptr [rax+r11*4+4],10h

The instruction at address 000007FEE36DA2D9 (vinsertps xmm8,xmm9,dword ptr [rax+r11*4],10h) causes the access violation, because r11 has the value 0xffffffff80000000. As you can see, r11 is loaded from the stack (movsxd r11,dword ptr [rsp+34h]), and the stack at address [rsp+34h] is: 00 00 00 80.

Regards,

Martin

Manuel_P_ · ‎05-19-2014

Hi again Brandon,

after some investigation I can say that the reason for the wrong values on the stack is a mathematical instability of the algorithm caused by the trigonometric functions and divisions which obviously only occurs with /fp:fast but not with /fp:precise. Actually the algorithm was designed to be robust against mathematical instabilities.

So we do need to get that code compiled with /fp:precise. Did you raise a bug?

Regards,

Martin

Nick_M_3 · ‎05-19-2014

martin.toeltsch@symena.com wrote:

after some investigation I can say that the reason for the wrong values on the stack is a mathematical instability of the algorithm caused by the trigonometric functions and divisions which obviously only occurs with /fp:fast but not with /fp:precise. Actually the algorithm was designed to be robust against mathematical instabilities.

Well, perhaps. However, I suggest that you would be better off attending to its robustness. In one place, you have:

    float hxx[VecSize];
    hxx[:] = fabsf(accel[:]) * float(1.38e-23);
    ...
    sv[:] = ((1.0f - hxx[:]) * (val0 - vgg[:])) + (hxx[:] * (vvv9 - gx2[:]));

That raises alarm bells in my head! At best, you have a very poorly scaled problem, and poor scaling is a classic cause of numerical problems. Despite modern belief, the use of floating-point (as distinct from fixed-point) is NOT a solution to all scaling problems, though you would have to look at some very serious (and probably 1960s or 1970s) textbooks to see discussions of the issues.

Internal compiler error 010101_239