Hi Jim!

marko_l_ · ‎08-16-2016

The following code should produce a rather simple output of increasing real parts and decreasing imaginary parts, someting along the lines of:
0+0i
1-1i
2-2i
3-3i
4-4i
5-5i
6-6i
etc...
99-99i

The code in question:

#include <iostream>
#include <iomanip>
#include <new>
#include <omp.h>
#include <mkl/mkl.h>

using namespace std;

typedef long long ll;

ostream& operator<<(ostream& lhs, MKL_Complex16& rhs) {
	lhs << fixed << setprecision(20) << rhs.real << '+' << fixed << setprecision(20) << rhs.imag << "*I";
	return lhs;
}

int main(void) {
	omp_set_num_threads(8);
	volatile ll ena = 1;
//#define ena 1
	ll temp_it = 100;
	ll ld = 2;
	MKL_Complex16** temp1 = new MKL_Complex16*[ld + ld]();
	for (ll i = 0; i < ld; i++) {
		temp1[ld + i] = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64);
		for (ll j = 0; j < temp_it; j++) {
			temp1[ld + i].real = i * j;
			temp1[ld + i].imag = -i * j;
		}
	}
	MKL_Complex16* M = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64);
#pragma omp parallel for
	for (ll i = 0; i < temp_it; i++) {
		M.real = temp1[ld + 0].real + temp1[ld + ena].real;
		M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag;
	}
	for (ll j = 0; j < ld; j++) {
		mkl_free(temp1[ld + j]);
	}
	delete[] temp1;
	for (ll i = 0; i < temp_it; i++) {
		cout << M << endl;
	}
	mkl_free(M);
	system("pause");
	return 0;
}

Which is the case with the code shown above. The generated assembly for the addition is then:

	for (ll i = 0; i < temp_it; i++) {
00007FF769FB1530  inc         r11  
		M.real = temp1[ld + 0].real + temp1[ld + ena].real;
00007FF769FB1533  mov         rbp,qword ptr [r10]  
00007FF769FB1536  shl         rbp,3  
00007FF769FB153A  movsd       xmm0,mmword ptr [rcx+r9]  
00007FF769FB1540  lea         rbp,[rbp+r14*8]  
00007FF769FB1545  mov         rbp,qword ptr [r8+rbp]  
00007FF769FB1549  addsd       xmm0,mmword ptr [rbp+r9]  
00007FF769FB1550  movsd       mmword ptr [rax+r9],xmm0  
		M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag;
00007FF769FB1556  mov         rbp,qword ptr [r10]  
00007FF769FB1559  shl         rbp,3  
00007FF769FB155D  movsd       xmm1,mmword ptr [rcx+r9+8]  
00007FF769FB1564  lea         rbp,[rbp+r14*8]  
00007FF769FB1569  mov         rbp,qword ptr [r8+rbp]  
00007FF769FB156D  addsd       xmm1,mmword ptr [rbp+r9+8]  
00007FF769FB1574  movsd       mmword ptr [rax+r9+8],xmm1

That looks about right, as is the result.

However, assume we change the commented definition of "ena". Like so:

	//volatile ll ena = 1;
#define ena 1

Doing so means that the compiler no longer has to check the value of "ena" at every step, and the code should be ever so slightly faster, but yield the same result. Instead, we get the following:

0.00000000000000000000+0.00000000000000000000*I
0.00000000000000000000+-1.00000000000000000000*I
0.00000000000000000000+-2.00000000000000000000*I
0.00000000000000000000+-3.00000000000000000000*I
0.00000000000000000000+-4.00000000000000000000*I
0.00000000000000000000+-5.00000000000000000000*I
0.00000000000000000000+-6.00000000000000000000*I
0.00000000000000000000+-7.00000000000000000000*I
8.00000000000000000000+-8.00000000000000000000*I
9.00000000000000000000+-9.00000000000000000000*I
10.00000000000000000000+-10.00000000000000000000*I
11.00000000000000000000+-11.00000000000000000000*I
12.00000000000000000000+-12.00000000000000000000*I
0.00000000000000000000+-13.00000000000000000000*I
0.00000000000000000000+-14.00000000000000000000*I
0.00000000000000000000+-15.00000000000000000000*I
0.00000000000000000000+-16.00000000000000000000*I
0.00000000000000000000+-17.00000000000000000000*I
0.00000000000000000000+-18.00000000000000000000*I
0.00000000000000000000+-19.00000000000000000000*I
0.00000000000000000000+-20.00000000000000000000*I
21.00000000000000000000+-21.00000000000000000000*I
22.00000000000000000000+-22.00000000000000000000*I
23.00000000000000000000+-23.00000000000000000000*I
24.00000000000000000000+-24.00000000000000000000*I
25.00000000000000000000+-25.00000000000000000000*I
0.00000000000000000000+-26.00000000000000000000*I
0.00000000000000000000+-27.00000000000000000000*I
0.00000000000000000000+-28.00000000000000000000*I
0.00000000000000000000+-29.00000000000000000000*I
0.00000000000000000000+-30.00000000000000000000*I
0.00000000000000000000+-31.00000000000000000000*I
0.00000000000000000000+-32.00000000000000000000*I
0.00000000000000000000+-33.00000000000000000000*I
34.00000000000000000000+-34.00000000000000000000*I
35.00000000000000000000+-35.00000000000000000000*I
36.00000000000000000000+-36.00000000000000000000*I
37.00000000000000000000+-37.00000000000000000000*I
38.00000000000000000000+-38.00000000000000000000*I
0.00000000000000000000+-39.00000000000000000000*I
0.00000000000000000000+-40.00000000000000000000*I
0.00000000000000000000+-41.00000000000000000000*I
0.00000000000000000000+-42.00000000000000000000*I
0.00000000000000000000+-43.00000000000000000000*I
0.00000000000000000000+-44.00000000000000000000*I
0.00000000000000000000+-45.00000000000000000000*I
0.00000000000000000000+-46.00000000000000000000*I
47.00000000000000000000+-47.00000000000000000000*I
48.00000000000000000000+-48.00000000000000000000*I
49.00000000000000000000+-49.00000000000000000000*I
50.00000000000000000000+-50.00000000000000000000*I
51.00000000000000000000+-51.00000000000000000000*I
0.00000000000000000000+-52.00000000000000000000*I
0.00000000000000000000+-53.00000000000000000000*I
0.00000000000000000000+-54.00000000000000000000*I
0.00000000000000000000+-55.00000000000000000000*I
0.00000000000000000000+-56.00000000000000000000*I
0.00000000000000000000+-57.00000000000000000000*I
0.00000000000000000000+-58.00000000000000000000*I
0.00000000000000000000+-59.00000000000000000000*I
60.00000000000000000000+-60.00000000000000000000*I
61.00000000000000000000+-61.00000000000000000000*I
62.00000000000000000000+-62.00000000000000000000*I
63.00000000000000000000+-63.00000000000000000000*I
0.00000000000000000000+-64.00000000000000000000*I
0.00000000000000000000+-65.00000000000000000000*I
0.00000000000000000000+-66.00000000000000000000*I
0.00000000000000000000+-67.00000000000000000000*I
0.00000000000000000000+-68.00000000000000000000*I
0.00000000000000000000+-69.00000000000000000000*I
0.00000000000000000000+-70.00000000000000000000*I
0.00000000000000000000+-71.00000000000000000000*I
72.00000000000000000000+-72.00000000000000000000*I
73.00000000000000000000+-73.00000000000000000000*I
74.00000000000000000000+-74.00000000000000000000*I
75.00000000000000000000+-75.00000000000000000000*I
0.00000000000000000000+-76.00000000000000000000*I
0.00000000000000000000+-77.00000000000000000000*I
0.00000000000000000000+-78.00000000000000000000*I
0.00000000000000000000+-79.00000000000000000000*I
0.00000000000000000000+-80.00000000000000000000*I
0.00000000000000000000+-81.00000000000000000000*I
0.00000000000000000000+-82.00000000000000000000*I
0.00000000000000000000+-83.00000000000000000000*I
84.00000000000000000000+-84.00000000000000000000*I
85.00000000000000000000+-85.00000000000000000000*I
86.00000000000000000000+-86.00000000000000000000*I
87.00000000000000000000+-87.00000000000000000000*I
0.00000000000000000000+-88.00000000000000000000*I
0.00000000000000000000+-89.00000000000000000000*I
0.00000000000000000000+-90.00000000000000000000*I
0.00000000000000000000+-91.00000000000000000000*I
0.00000000000000000000+-92.00000000000000000000*I
0.00000000000000000000+-93.00000000000000000000*I
0.00000000000000000000+-94.00000000000000000000*I
0.00000000000000000000+-95.00000000000000000000*I
96.00000000000000000000+-96.00000000000000000000*I
97.00000000000000000000+-97.00000000000000000000*I
98.00000000000000000000+-98.00000000000000000000*I
99.00000000000000000000+-99.00000000000000000000*I

As we can see we get a pattern of 8 wrong answers (only the real parts) followed by 5 correct answers. Setting the number of threads to lower values only makes this worse, with more of the results being wrong. Namely, 96 wrong and 4 correct, in that order. In this case, the calculation of the imaginary part doesn't change much, although the two (which start in the same loop) end up in two different loops, the following is the part that computes most of the real values (but not all, not the ones that are correct):

	for (ll i = 0; i < temp_it; i++) {
00007FF68B4D14F0  add         r13,8  
		M.real = temp1[ld + 0].real + temp1[ld + ena].real;
00007FF68B4D14F4  mov         rdx,qword ptr [r8+rsi*8]  
00007FF68B4D14F8  add         rdx,r14  
00007FF68B4D14FB  movsd       xmm1,mmword ptr [rdx+r15]  
00007FF68B4D1501  movhpd      xmm1,qword ptr [rdx+r15+8]  
00007FF68B4D1508  movsd       xmm0,mmword ptr [rdx+r15+10h]  
00007FF68B4D150F  movaps      xmm2,xmm1  
00007FF68B4D1512  movhpd      xmm0,qword ptr [rdx+r15+18h]  
00007FF68B4D1519  unpcklpd    xmm2,xmm0  
00007FF68B4D151D  unpckhpd    xmm1,xmm0  
00007FF68B4D1521  addpd       xmm2,xmm1  
00007FF68B4D1525  movsd       mmword ptr [r15+r11],xmm2  
00007FF68B4D152B  movhpd      qword ptr [r15+r11+10h],xmm2  
00007FF68B4D1532  mov         rdx,qword ptr [r8+rsi*8]  
00007FF68B4D1536  add         rdx,r14  
00007FF68B4D1539  movsd       xmm4,mmword ptr [rdx+r15+20h]  
00007FF68B4D1540  movhpd      xmm4,qword ptr [rdx+r15+28h]  
00007FF68B4D1547  movsd       xmm3,mmword ptr [rdx+r15+30h]  
00007FF68B4D154E  movaps      xmm5,xmm4  
00007FF68B4D1551  movhpd      xmm3,qword ptr [rdx+r15+38h]  
00007FF68B4D1558  unpcklpd    xmm5,xmm3  
00007FF68B4D155C  unpckhpd    xmm4,xmm3  
00007FF68B4D1560  addpd       xmm5,xmm4  
00007FF68B4D1564  movsd       mmword ptr [r15+r11+20h],xmm5  
00007FF68B4D156B  movhpd      qword ptr [r15+r11+30h],xmm5  
00007FF68B4D1572  mov         rdx,qword ptr [r8+rsi*8]  
00007FF68B4D1576  add         rdx,r14  
00007FF68B4D1579  movsd       xmm1,mmword ptr [rdx+r15+40h]  
00007FF68B4D1580  movhpd      xmm1,qword ptr [rdx+r15+48h]  
00007FF68B4D1587  movsd       xmm0,mmword ptr [rdx+r15+50h]  
00007FF68B4D158E  movaps      xmm2,xmm1  
00007FF68B4D1591  movhpd      xmm0,qword ptr [rdx+r15+58h]  
00007FF68B4D1598  unpcklpd    xmm2,xmm0  
00007FF68B4D159C  unpckhpd    xmm1,xmm0  
00007FF68B4D15A0  addpd       xmm2,xmm1  
00007FF68B4D15A4  movsd       mmword ptr [r15+r11+40h],xmm2  
00007FF68B4D15AB  movhpd      qword ptr [r15+r11+50h],xmm2  
00007FF68B4D15B2  mov         rdx,qword ptr [r8+rsi*8]  
00007FF68B4D15B6  add         rdx,r14  
00007FF68B4D15B9  movsd       xmm4,mmword ptr [rdx+r15+60h]  
00007FF68B4D15C0  movhpd      xmm4,qword ptr [rdx+r15+68h]  
00007FF68B4D15C7  movsd       xmm3,mmword ptr [rdx+r15+70h]  
00007FF68B4D15CE  movaps      xmm0,xmm4  
00007FF68B4D15D1  movhpd      xmm3,qword ptr [rdx+r15+78h]  
00007FF68B4D15D8  unpcklpd    xmm0,xmm3  
00007FF68B4D15DC  unpckhpd    xmm4,xmm3  
00007FF68B4D15E0  addpd       xmm0,xmm4  
00007FF68B4D15E4  movsd       mmword ptr [r15+r11+60h],xmm0  
00007FF68B4D15EB  movhpd      qword ptr [r15+r11+70h],xmm0

To the best of my understanding, the loop is unrolled several times, although rdx is updated every time one of these... operations... takes place (there's no reason to do that, the array isn't going anywhere). However, the mindblowing thing is what a single operation here does (I might have gotten something wrong, but the result is consistent with my understanding of this code).

It starts off by loading a number from temp1[0] or temp1[1] (can't quite tell which one) into both parts of the register xmm1 (real and imaginary part accordingly). So far so good. It then loads the next element of temp1[0] or temp1[1] (THE SAME ARRAY, this is no longer good...) into xmm0. So far each of the two registers holds a complex number, but not the ones we'd want. It then makes a copy of xmm1 and stores it in xmm2. The next move is somewhat amazing, the following two routines separate these values, so that xmm2 now holds both the imaginary parts (the part from xmm0 in the lower section and xmm1 in the higher) and xmm1 holds both real parts in the same order. It then proceeds to add the two values (so, real + imaginary, something that rarely makes sense, clearly the result will be zero, due to the way we filled the initial arrays) and saves them into two consequent real parts of the array M.

This was tested on i7-4720HQ with the following command line:

/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\DMRG.pch"

Adding AVX2 support simply vectorizes some of these operations, but the logic behind them (or rather lack of) remains the same, as does the result.

Changing the order within the code (imaginary first) also has no effect on the end results.

For reference, I am using Windows 10 with Visual Studio 2015 Update 3 and Intel Parallel Studio XE Update 3 (so, the latest compiler).

Thank you for your time.

TimP · ‎08-16-2016

Evidently, your volatile definition prevents the compiler from vectorizing. Does setting vectorization off explicitly (e.g. by -Qvec- or by #pragma novector) produce a correct result? Do you target SSE3? Can you simply assign each M[] in one step rather than by real and imaginary parts?

You could turn off unrolling by -Qunroll:0 or #pragma unroll(0).

I'm having difficulty saving your code in compilable form.

marko_l_ · ‎08-16-2016

The volatile was added to test the hipothesis that the issue is with the +1, using volatile forces the compiler to avoid hardcoding

temp1[i+1] to temp1[k+1] where the last index represents the real/imaginary part. (This appears to be the case).

The optimization was set to either AVX2 using /QxCORE-AVX2 and /QaxCORE-AVX2 or None, which I assumed would mean no vectorization.

Adding AVX2 produces the following (worse results), this is with #define of course:

0.00000000000000000000+0.00000000000000000000*I
0.00000000000000000000+-1.00000000000000000000*I
0.00000000000000000000+-2.00000000000000000000*I
0.00000000000000000000+-3.00000000000000000000*I
0.00000000000000000000+-4.00000000000000000000*I
0.00000000000000000000+-5.00000000000000000000*I
0.00000000000000000000+-6.00000000000000000000*I
0.00000000000000000000+-7.00000000000000000000*I
0.00000000000000000000+-8.00000000000000000000*I
0.00000000000000000000+-9.00000000000000000000*I
0.00000000000000000000+-10.00000000000000000000*I
0.00000000000000000000+-11.00000000000000000000*I
12.00000000000000000000+-12.00000000000000000000*I
0.00000000000000000000+-13.00000000000000000000*I
0.00000000000000000000+-14.00000000000000000000*I
0.00000000000000000000+-15.00000000000000000000*I
0.00000000000000000000+-16.00000000000000000000*I
0.00000000000000000000+-17.00000000000000000000*I
0.00000000000000000000+-18.00000000000000000000*I
0.00000000000000000000+-19.00000000000000000000*I
0.00000000000000000000+-20.00000000000000000000*I
0.00000000000000000000+-21.00000000000000000000*I
0.00000000000000000000+-22.00000000000000000000*I
0.00000000000000000000+-23.00000000000000000000*I
0.00000000000000000000+-24.00000000000000000000*I
25.00000000000000000000+-25.00000000000000000000*I
0.00000000000000000000+-26.00000000000000000000*I
0.00000000000000000000+-27.00000000000000000000*I
0.00000000000000000000+-28.00000000000000000000*I
0.00000000000000000000+-29.00000000000000000000*I
0.00000000000000000000+-30.00000000000000000000*I
0.00000000000000000000+-31.00000000000000000000*I
0.00000000000000000000+-32.00000000000000000000*I
0.00000000000000000000+-33.00000000000000000000*I
0.00000000000000000000+-34.00000000000000000000*I
0.00000000000000000000+-35.00000000000000000000*I
0.00000000000000000000+-36.00000000000000000000*I
0.00000000000000000000+-37.00000000000000000000*I
38.00000000000000000000+-38.00000000000000000000*I
0.00000000000000000000+-39.00000000000000000000*I
0.00000000000000000000+-40.00000000000000000000*I
0.00000000000000000000+-41.00000000000000000000*I
0.00000000000000000000+-42.00000000000000000000*I
0.00000000000000000000+-43.00000000000000000000*I
0.00000000000000000000+-44.00000000000000000000*I
0.00000000000000000000+-45.00000000000000000000*I
0.00000000000000000000+-46.00000000000000000000*I
0.00000000000000000000+-47.00000000000000000000*I
0.00000000000000000000+-48.00000000000000000000*I
0.00000000000000000000+-49.00000000000000000000*I
0.00000000000000000000+-50.00000000000000000000*I
51.00000000000000000000+-51.00000000000000000000*I
0.00000000000000000000+-52.00000000000000000000*I
0.00000000000000000000+-53.00000000000000000000*I
0.00000000000000000000+-54.00000000000000000000*I
0.00000000000000000000+-55.00000000000000000000*I
0.00000000000000000000+-56.00000000000000000000*I
0.00000000000000000000+-57.00000000000000000000*I
0.00000000000000000000+-58.00000000000000000000*I
0.00000000000000000000+-59.00000000000000000000*I
0.00000000000000000000+-60.00000000000000000000*I
0.00000000000000000000+-61.00000000000000000000*I
0.00000000000000000000+-62.00000000000000000000*I
0.00000000000000000000+-63.00000000000000000000*I
0.00000000000000000000+-64.00000000000000000000*I
0.00000000000000000000+-65.00000000000000000000*I
0.00000000000000000000+-66.00000000000000000000*I
0.00000000000000000000+-67.00000000000000000000*I
0.00000000000000000000+-68.00000000000000000000*I
0.00000000000000000000+-69.00000000000000000000*I
0.00000000000000000000+-70.00000000000000000000*I
0.00000000000000000000+-71.00000000000000000000*I
0.00000000000000000000+-72.00000000000000000000*I
0.00000000000000000000+-73.00000000000000000000*I
0.00000000000000000000+-74.00000000000000000000*I
0.00000000000000000000+-75.00000000000000000000*I
0.00000000000000000000+-76.00000000000000000000*I
0.00000000000000000000+-77.00000000000000000000*I
0.00000000000000000000+-78.00000000000000000000*I
0.00000000000000000000+-79.00000000000000000000*I
0.00000000000000000000+-80.00000000000000000000*I
0.00000000000000000000+-81.00000000000000000000*I
0.00000000000000000000+-82.00000000000000000000*I
0.00000000000000000000+-83.00000000000000000000*I
0.00000000000000000000+-84.00000000000000000000*I
0.00000000000000000000+-85.00000000000000000000*I
0.00000000000000000000+-86.00000000000000000000*I
0.00000000000000000000+-87.00000000000000000000*I
0.00000000000000000000+-88.00000000000000000000*I
0.00000000000000000000+-89.00000000000000000000*I
0.00000000000000000000+-90.00000000000000000000*I
0.00000000000000000000+-91.00000000000000000000*I
0.00000000000000000000+-92.00000000000000000000*I
0.00000000000000000000+-93.00000000000000000000*I
0.00000000000000000000+-94.00000000000000000000*I
0.00000000000000000000+-95.00000000000000000000*I
0.00000000000000000000+-96.00000000000000000000*I
0.00000000000000000000+-97.00000000000000000000*I
0.00000000000000000000+-98.00000000000000000000*I
0.00000000000000000000+-99.00000000000000000000*I

The imaginary part is still correct:

		M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag;
00007FF7361017FC  mov         rbx,qword ptr [rax+rsi*8]  
00007FF736101800  mov         rax,qword ptr [rax+rsi*8+8]  
00007FF736101805  vmovsd      xmm0,qword ptr [r8+rbx-8]  
00007FF73610180C  vaddsd      xmm1,xmm0,mmword ptr [r8+rax-8]  
00007FF736101813  vmovsd      qword ptr [r8+rcx-8],xmm1

The real part is now vectorized further using the larger registers:

		M.real = temp1[ld + 0].real + temp1[ld + ena].real;
00007FF7361014F7  mov         r12,qword ptr [rax+rsi*8]  
00007FF7361014FB  add         r12,r10  
00007FF7361014FE  vmovupd     ymm2,ymmword ptr [r12+r8]  
00007FF736101504  vmovupd     ymm1,ymmword ptr [r12+r8+20h]  
00007FF73610150B  vperm2f128  ymm4,ymm2,ymm1,20h  
00007FF736101511  vperm2f128  ymm5,ymm2,ymm1,31h  
00007FF736101517  vmovupd     ymm2,ymmword ptr [r12+r8+40h]  
00007FF73610151E  vmovupd     ymm1,ymmword ptr [r12+r8+60h]  
00007FF736101525  vunpcklpd   ymm3,ymm4,ymm5  
00007FF736101529  vunpckhpd   ymm0,ymm4,ymm5  
00007FF73610152D  vperm2f128  ymm5,ymm2,ymm1,20h  
00007FF736101533  vperm2f128  ymm2,ymm2,ymm1,31h  
00007FF736101539  vunpcklpd   ymm4,ymm5,ymm2  
00007FF73610153D  vunpckhpd   ymm5,ymm5,ymm2  
00007FF736101541  vaddpd      ymm0,ymm3,ymm0  
00007FF736101545  vaddpd      ymm4,ymm4,ymm5  
00007FF736101549  vextractf128 xmm3,ymm0,1  
00007FF73610154F  vmovsd      qword ptr [r8+r9],xmm0  
00007FF736101555  vmovhpd     qword ptr [r8+r9+10h],xmm0  
00007FF73610155C  vextractf128 xmm0,ymm4,1  
00007FF736101562  vmovsd      qword ptr [r8+r9+20h],xmm3  
00007FF736101569  vmovhpd     qword ptr [r8+r9+30h],xmm3  
00007FF736101570  vmovsd      qword ptr [r8+r9+40h],xmm4  
00007FF736101577  vmovhpd     qword ptr [r8+r9+50h],xmm4  
00007FF73610157E  vmovsd      qword ptr [r8+r9+60h],xmm0  
00007FF736101585  vmovhpd     qword ptr [r8+r9+70h],xmm0  
00007FF73610158C  mov         r14,qword ptr [rax+rsi*8]  
00007FF736101590  add         r14,r10  
00007FF736101593  vmovupd     ymm3,ymmword ptr [r14+r8+80h]  
00007FF73610159D  vmovupd     ymm2,ymmword ptr [r14+r8+0A0h]  
00007FF7361015A7  vmovupd     ymm0,ymmword ptr [r14+r8+0C0h]  
00007FF7361015B1  vmovupd     ymm4,ymmword ptr [r14+r8+0E0h]  
00007FF7361015BB  vperm2f128  ymm1,ymm3,ymm2,20h  
00007FF7361015C1  vperm2f128  ymm5,ymm3,ymm2,31h  
00007FF7361015C7  vunpcklpd   ymm3,ymm1,ymm5  
00007FF7361015CB  vunpckhpd   ymm2,ymm1,ymm5  
00007FF7361015CF  vperm2f128  ymm1,ymm0,ymm4,20h  
00007FF7361015D5  vperm2f128  ymm5,ymm0,ymm4,31h  
00007FF7361015DB  vunpcklpd   ymm0,ymm1,ymm5  
00007FF7361015DF  vunpckhpd   ymm1,ymm1,ymm5  
00007FF7361015E3  vaddpd      ymm2,ymm3,ymm2  
00007FF7361015E7  vaddpd      ymm0,ymm0,ymm1  
00007FF7361015EB  vextractf128 xmm3,ymm2,1  
00007FF7361015F1  vextractf128 xmm1,ymm0,1  
00007FF7361015F7  vmovsd      qword ptr [r8+r9+80h],xmm2  
00007FF736101601  vmovhpd     qword ptr [r8+r9+90h],xmm2  
00007FF73610160B  vmovsd      qword ptr [r8+r9+0A0h],xmm3  
00007FF736101615  vmovhpd     qword ptr [r8+r9+0B0h],xmm3  
00007FF73610161F  vmovsd      qword ptr [r8+r9+0C0h],xmm0  
00007FF736101629  vmovhpd     qword ptr [r8+r9+0D0h],xmm0  
00007FF736101633  vmovsd      qword ptr [r8+r9+0E0h],xmm1  
00007FF73610163D  vmovhpd     qword ptr [r8+r9+0F0h],xmm1

Using the volatile integer is still correct.

Adding #pragma novector before #pragma omp parallel for also solves the issue. So the issue is with the automatic vectorization of only the real part, the imaginary never does so (perhaps due to alignment?).

As for getting it running, I copied the code into a new project, set the compiler options and it compiles fine. But you do have to link the mkl libraries for MKL_Complex16, mkl_calloc and mkl_free, although it is possible that these are not relevant.

The following code without mkl produces the same error exact error (which makes sense):

#include <iostream>
#include <iomanip>
#include <new>
#include <omp.h>

using namespace std;

typedef long long ll;

#define mkl_calloc(X,Y,Z) calloc(X,Y)
#define mkl_free(X) free(X)

struct MKL_Complex16 {
	double real;
	double imag;
};

ostream& operator<<(ostream& lhs, MKL_Complex16& rhs) {
	lhs << fixed << setprecision(20) << rhs.real << '+' << fixed << setprecision(20) << rhs.imag << "*I";
	return lhs;
}

int main(void) {
	omp_set_num_threads(8);
	//volatile ll ena = 1;
	#define ena 1
	ll temp_it = 100;
	ll ld = 2;
	MKL_Complex16** temp1 = new MKL_Complex16*[ld + ld]();
	for (ll i = 0; i < ld; i++) {
		temp1[ld + i] = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64);
		for (ll j = 0; j < temp_it; j++) {
			temp1[ld + i].real = i * j;
			temp1[ld + i].imag = -i * j;
		}
	}
	MKL_Complex16* M = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64);
#pragma omp parallel for
	for (ll i = 0; i < temp_it; i++) {
		M.real = temp1[ld + 0].real + temp1[ld + ena].real;
		M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag;
	}
	for (ll j = 0; j < ld; j++) {
		mkl_free(temp1[ld + j]);
	}
	delete[] temp1;
	for (ll i = 0; i < temp_it; i++) {
		cout << M << endl;
	}
	mkl_free(M);
	system("pause");
	return 0;
}

The only real difference being that the arrays aren't aligned. Still, adding #pragma novector in front of #pragma omp parallel for does solve the issue. But considering that shouldn't be necessary and since no warnings are produced by the compiler, something goes terribly wrong.

Best regards,
Marko

marko_l_ · ‎08-16-2016

Interestingly enough removing #pragma omp parallel for did previously help with this issue (while this was part of a larger project), at this point however, removing paralelization appears to cause ALL of the real parts to become 0.

As far as I can tell, the assembly code remains similar, adding the most unusual things together, for reasons unknown.

However, perhaps the issue is not with openmp, I can't say for sure, as I have not run that many tests.

Best regards,
Marko

marko_l_ · ‎08-22-2016

It seems to have been a while since the last reply. Have you made any progress trying to compile the code?

Best regards,
Marko

jimdempseyatthecove · ‎08-23-2016

Try this:

Using the volatile ena

...
 volatile ll ena = 1;
...
  for (ll j = 0; j < temp_it; j++) {
   temp1[ld + i].real = i * j * ena;
   temp1[ld + i].imag = -i * j * ena;
  }
...
 }
#pragma omp parallel for
 for (ll i = 0; i < temp_it; i++) {
  M.real = temp1[ld + 0].real + temp1[ld + 1].real; // +1 not +ena
  M.imag = temp1[ld + 0].imag + temp1[ld + 1].imag;
 }
...

If the above correct the problem, then this is indicative of the compiler optimization being overly aggressive (and with error) in trying to make the temp1 array loop invariant with known values for use in the parallel for loop. Initializing the temp1 array as above may influence the compiler into thinking the temp1 has indeterminable values, and thus not available for loop invariant substitution.

An alternate test would be to use the original code without the volatile, but print out the contents of temp1 _before_ you run parallel for loop.

Jim Dempsey

marko_l_ · ‎08-23-2016

Hi Jim!

Adding the volatile, as you suggested, makes no difference, which makes sense, according to the assembly generated by the compiler. The current code, that can still easily reproduce the error is:

#include <iostream>
#include <iomanip>
#include <new>

//#define _OVR

using namespace std;

typedef long long ll;

struct Complex16 {
	double real;
	double imag;
};

#ifdef _OVR
Complex16 operator+(Complex16& lhs, Complex16& rhs) {
	return{ lhs.real + rhs.real, lhs.imag + rhs.imag };
}
#endif

ostream& operator<<(ostream& lhs, Complex16& rhs) {
	lhs << fixed << setprecision(20) << rhs.real << '+' << fixed << setprecision(20) << rhs.imag << "*I";
	return lhs;
}

int main(void) {
	ll temp_it = 100;
	ll ld = 2;
	Complex16** temp1 = new Complex16*[ld + ld]();
	for (ll i = 0; i < ld; i++) {
		temp1[ld + i] = (Complex16*)calloc(temp_it, sizeof(Complex16));
		for (ll j = 0; j < temp_it; j++) {
			temp1[ld + i].real = i * j;
			temp1[ld + i].imag = -i * j;
		}
	}
	Complex16* M = (Complex16*)calloc(temp_it, sizeof(Complex16));
//#pragma novector
	for (ll i = 0; i < temp_it; i++) {
#ifndef _OVR
		M.real = temp1[ld + 0].real + temp1[ld + 1].real;
		M.imag = temp1[ld + 0].imag + temp1[ld + 1].imag;
#else
		M = temp1[ld + 0] + temp1[ld + 1];
#endif
	}
	for (ll j = 0; j < ld; j++) {
		free(temp1[ld + j]);
	}
	delete[] temp1;
	for (ll i = 0; i < temp_it; i++) {
		cout << M << endl;
	}
	free(M);
	fflush(stdin);
	getchar();
	return 0;
}

Here, the define _OVR simply overloads the addition operator, which I believe is what Tim P. had suggested earlier, but again, there is little difference. The result is the same, all real parts are wrong and all complex parts are correct. This is also without parallel execution, the issue is in the automatic vectorisation of the code, which is plain wrong.

I did check the values of the arrays, although if that went wrong that would, quite frankly, be even worse. Luckily, they are correct.

The only way to get the correct results is by adding

#pragma novector

This however is extremely problematic for two reasons:
a) If you don't know about this bug, your program will produce incorrect results for no apparent reason and what is worse, if this is a part of a larger program you might not even notice that the corrects are wrong, as the compiler doesn't issue any warning (because as far as it's concerned, nothing is wrong). As one of my colleagues commented, these are the kinds of things that bring planes down. :) (Hopefully not for real, of course)
b) Even if you do know about this and add this, it means that you can not take advantage of vectorisation, which is unfortunate to say the least.

Best regards,
Marko

MalReddy_Y_Intel · ‎08-24-2016

Hi Marko,

Please let us know the exact command line options to compile above source code, I'm facing many compilation errors with above mentioned command line options.

Thanks,

Reddy

marko_l_ · ‎08-24-2016

Hi Reddy,

The command line as shown in VS2015 Update 3 using Intel Parallel Studio XE 2016 Update 3 & accompanying compiler:

/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\Test.pch"

Hasn't changed (except for the last command, because I have created a new project). I have copied the code from the forum back in to make sure I didn't forget to copy anything, but it works exactly the same as it did before and I am having no difficulties compiling and running the code, even when forcing a full rebuild.

What kind of errors are you getting?

You can try a simplified command line:

/GS- /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /D "_MBCS" /Qipo /Zc:forScope /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\Test.pch"

This should work if the offloads were causing your errors (could be if you're running on an older CPU, unlikely on anything newer than mine, but you never know). The results don't change at all, as the code in question (see the assembly) doesn't use any of that.

To clarify, the target is a 64bit application, as evident from the x64/ in the paths within the command line.

Best regards,
Marko

marko_l_ · ‎08-24-2016

I have now copied the source code (now attached to the post), sent it to a linux cluster, compiled with the following command line:

icpc -static -std=c++11 -Wall -O3 -c test.cpp -o test.o

icpc test.o -o test

And received the same (wrong) result. There were no issues with the compilation. Forcing #pragma novector produces the correct result, much like on my local machine.

The Linux machines are also running the 2016 compiler, but I'm not sure about the update nor how to check that quickly.

Best regards,
Marko

jimdempseyatthecove · ‎08-24-2016

Marko,

Your problem is this statement:

  temp1[ld + i] = (Complex16*)calloc(temp_it, sizeof(Complex16));

It is not performing an aligned allocation. Consider:

  temp1[ld + i] = (Complex16*)aligned_malloc(64,temp_it*sizeof(Complex16));

Jim Dempsey

marko_l_ · ‎08-24-2016

Hi Jim!

That is not the case, if you look at the initial code provided above (first post) as well as the generated assembly (also posted above) it is clear that alignment plays no role in this error.

Also, even if that were the case, the automatic optimization, done by the compiler, should not make assumptions regarding the alignment of the data throughout the code, unless hinted otherwise. At the very least it should provide a warning message. But as I said, that is not the issue here, as proven by the initial, aligned, version.

Best regards,
Marko

marko_l_ · ‎10-04-2016

I can't seem to locate any information about bugs that have been fixed for the compiler, is this issue fixed in the latest release?

Best regards,
Marko

marko_l_ · ‎10-11-2016

If anyone is interested, the problem still exists in Intel C++ 17, the code attached to a previous post produces wrong results on both platforms (Windows & Linux).

Best regards,
Marko

Compiler bug - openmp