- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code should produce a rather simple output of increasing real parts and decreasing imaginary parts, someting along the lines of:
0+0i
1-1i
2-2i
3-3i
4-4i
5-5i
6-6i
etc...
99-99i
The code in question:
#include <iostream> #include <iomanip> #include <new> #include <omp.h> #include <mkl/mkl.h> using namespace std; typedef long long ll; ostream& operator<<(ostream& lhs, MKL_Complex16& rhs) { lhs << fixed << setprecision(20) << rhs.real << '+' << fixed << setprecision(20) << rhs.imag << "*I"; return lhs; } int main(void) { omp_set_num_threads(8); volatile ll ena = 1; //#define ena 1 ll temp_it = 100; ll ld = 2; MKL_Complex16** temp1 = new MKL_Complex16*[ld + ld](); for (ll i = 0; i < ld; i++) { temp1[ld + i] = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64); for (ll j = 0; j < temp_it; j++) { temp1[ld + i].real = i * j; temp1[ld + i] .imag = -i * j; } } MKL_Complex16* M = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64); #pragma omp parallel for for (ll i = 0; i < temp_it; i++) { M.real = temp1[ld + 0].real + temp1[ld + ena].real; M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag; } for (ll j = 0; j < ld; j++) { mkl_free(temp1[ld + j]); } delete[] temp1; for (ll i = 0; i < temp_it; i++) { cout << M << endl; } mkl_free(M); system("pause"); return 0; }
Which is the case with the code shown above. The generated assembly for the addition is then:
for (ll i = 0; i < temp_it; i++) { 00007FF769FB1530 inc r11 M.real = temp1[ld + 0].real + temp1[ld + ena].real; 00007FF769FB1533 mov rbp,qword ptr [r10] 00007FF769FB1536 shl rbp,3 00007FF769FB153A movsd xmm0,mmword ptr [rcx+r9] 00007FF769FB1540 lea rbp,[rbp+r14*8] 00007FF769FB1545 mov rbp,qword ptr [r8+rbp] 00007FF769FB1549 addsd xmm0,mmword ptr [rbp+r9] 00007FF769FB1550 movsd mmword ptr [rax+r9],xmm0 M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag; 00007FF769FB1556 mov rbp,qword ptr [r10] 00007FF769FB1559 shl rbp,3 00007FF769FB155D movsd xmm1,mmword ptr [rcx+r9+8] 00007FF769FB1564 lea rbp,[rbp+r14*8] 00007FF769FB1569 mov rbp,qword ptr [r8+rbp] 00007FF769FB156D addsd xmm1,mmword ptr [rbp+r9+8] 00007FF769FB1574 movsd mmword ptr [rax+r9+8],xmm1
That looks about right, as is the result.
However, assume we change the commented definition of "ena". Like so:
//volatile ll ena = 1; #define ena 1
Doing so means that the compiler no longer has to check the value of "ena" at every step, and the code should be ever so slightly faster, but yield the same result. Instead, we get the following:
0.00000000000000000000+0.00000000000000000000*I
0.00000000000000000000+-1.00000000000000000000*I
0.00000000000000000000+-2.00000000000000000000*I
0.00000000000000000000+-3.00000000000000000000*I
0.00000000000000000000+-4.00000000000000000000*I
0.00000000000000000000+-5.00000000000000000000*I
0.00000000000000000000+-6.00000000000000000000*I
0.00000000000000000000+-7.00000000000000000000*I
8.00000000000000000000+-8.00000000000000000000*I
9.00000000000000000000+-9.00000000000000000000*I
10.00000000000000000000+-10.00000000000000000000*I
11.00000000000000000000+-11.00000000000000000000*I
12.00000000000000000000+-12.00000000000000000000*I
0.00000000000000000000+-13.00000000000000000000*I
0.00000000000000000000+-14.00000000000000000000*I
0.00000000000000000000+-15.00000000000000000000*I
0.00000000000000000000+-16.00000000000000000000*I
0.00000000000000000000+-17.00000000000000000000*I
0.00000000000000000000+-18.00000000000000000000*I
0.00000000000000000000+-19.00000000000000000000*I
0.00000000000000000000+-20.00000000000000000000*I
21.00000000000000000000+-21.00000000000000000000*I
22.00000000000000000000+-22.00000000000000000000*I
23.00000000000000000000+-23.00000000000000000000*I
24.00000000000000000000+-24.00000000000000000000*I
25.00000000000000000000+-25.00000000000000000000*I
0.00000000000000000000+-26.00000000000000000000*I
0.00000000000000000000+-27.00000000000000000000*I
0.00000000000000000000+-28.00000000000000000000*I
0.00000000000000000000+-29.00000000000000000000*I
0.00000000000000000000+-30.00000000000000000000*I
0.00000000000000000000+-31.00000000000000000000*I
0.00000000000000000000+-32.00000000000000000000*I
0.00000000000000000000+-33.00000000000000000000*I
34.00000000000000000000+-34.00000000000000000000*I
35.00000000000000000000+-35.00000000000000000000*I
36.00000000000000000000+-36.00000000000000000000*I
37.00000000000000000000+-37.00000000000000000000*I
38.00000000000000000000+-38.00000000000000000000*I
0.00000000000000000000+-39.00000000000000000000*I
0.00000000000000000000+-40.00000000000000000000*I
0.00000000000000000000+-41.00000000000000000000*I
0.00000000000000000000+-42.00000000000000000000*I
0.00000000000000000000+-43.00000000000000000000*I
0.00000000000000000000+-44.00000000000000000000*I
0.00000000000000000000+-45.00000000000000000000*I
0.00000000000000000000+-46.00000000000000000000*I
47.00000000000000000000+-47.00000000000000000000*I
48.00000000000000000000+-48.00000000000000000000*I
49.00000000000000000000+-49.00000000000000000000*I
50.00000000000000000000+-50.00000000000000000000*I
51.00000000000000000000+-51.00000000000000000000*I
0.00000000000000000000+-52.00000000000000000000*I
0.00000000000000000000+-53.00000000000000000000*I
0.00000000000000000000+-54.00000000000000000000*I
0.00000000000000000000+-55.00000000000000000000*I
0.00000000000000000000+-56.00000000000000000000*I
0.00000000000000000000+-57.00000000000000000000*I
0.00000000000000000000+-58.00000000000000000000*I
0.00000000000000000000+-59.00000000000000000000*I
60.00000000000000000000+-60.00000000000000000000*I
61.00000000000000000000+-61.00000000000000000000*I
62.00000000000000000000+-62.00000000000000000000*I
63.00000000000000000000+-63.00000000000000000000*I
0.00000000000000000000+-64.00000000000000000000*I
0.00000000000000000000+-65.00000000000000000000*I
0.00000000000000000000+-66.00000000000000000000*I
0.00000000000000000000+-67.00000000000000000000*I
0.00000000000000000000+-68.00000000000000000000*I
0.00000000000000000000+-69.00000000000000000000*I
0.00000000000000000000+-70.00000000000000000000*I
0.00000000000000000000+-71.00000000000000000000*I
72.00000000000000000000+-72.00000000000000000000*I
73.00000000000000000000+-73.00000000000000000000*I
74.00000000000000000000+-74.00000000000000000000*I
75.00000000000000000000+-75.00000000000000000000*I
0.00000000000000000000+-76.00000000000000000000*I
0.00000000000000000000+-77.00000000000000000000*I
0.00000000000000000000+-78.00000000000000000000*I
0.00000000000000000000+-79.00000000000000000000*I
0.00000000000000000000+-80.00000000000000000000*I
0.00000000000000000000+-81.00000000000000000000*I
0.00000000000000000000+-82.00000000000000000000*I
0.00000000000000000000+-83.00000000000000000000*I
84.00000000000000000000+-84.00000000000000000000*I
85.00000000000000000000+-85.00000000000000000000*I
86.00000000000000000000+-86.00000000000000000000*I
87.00000000000000000000+-87.00000000000000000000*I
0.00000000000000000000+-88.00000000000000000000*I
0.00000000000000000000+-89.00000000000000000000*I
0.00000000000000000000+-90.00000000000000000000*I
0.00000000000000000000+-91.00000000000000000000*I
0.00000000000000000000+-92.00000000000000000000*I
0.00000000000000000000+-93.00000000000000000000*I
0.00000000000000000000+-94.00000000000000000000*I
0.00000000000000000000+-95.00000000000000000000*I
96.00000000000000000000+-96.00000000000000000000*I
97.00000000000000000000+-97.00000000000000000000*I
98.00000000000000000000+-98.00000000000000000000*I
99.00000000000000000000+-99.00000000000000000000*I
As we can see we get a pattern of 8 wrong answers (only the real parts) followed by 5 correct answers. Setting the number of threads to lower values only makes this worse, with more of the results being wrong. Namely, 96 wrong and 4 correct, in that order. In this case, the calculation of the imaginary part doesn't change much, although the two (which start in the same loop) end up in two different loops, the following is the part that computes most of the real values (but not all, not the ones that are correct):
for (ll i = 0; i < temp_it; i++) { 00007FF68B4D14F0 add r13,8 M.real = temp1[ld + 0].real + temp1[ld + ena].real; 00007FF68B4D14F4 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D14F8 add rdx,r14 00007FF68B4D14FB movsd xmm1,mmword ptr [rdx+r15] 00007FF68B4D1501 movhpd xmm1,qword ptr [rdx+r15+8] 00007FF68B4D1508 movsd xmm0,mmword ptr [rdx+r15+10h] 00007FF68B4D150F movaps xmm2,xmm1 00007FF68B4D1512 movhpd xmm0,qword ptr [rdx+r15+18h] 00007FF68B4D1519 unpcklpd xmm2,xmm0 00007FF68B4D151D unpckhpd xmm1,xmm0 00007FF68B4D1521 addpd xmm2,xmm1 00007FF68B4D1525 movsd mmword ptr [r15+r11],xmm2 00007FF68B4D152B movhpd qword ptr [r15+r11+10h],xmm2 00007FF68B4D1532 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D1536 add rdx,r14 00007FF68B4D1539 movsd xmm4,mmword ptr [rdx+r15+20h] 00007FF68B4D1540 movhpd xmm4,qword ptr [rdx+r15+28h] 00007FF68B4D1547 movsd xmm3,mmword ptr [rdx+r15+30h] 00007FF68B4D154E movaps xmm5,xmm4 00007FF68B4D1551 movhpd xmm3,qword ptr [rdx+r15+38h] 00007FF68B4D1558 unpcklpd xmm5,xmm3 00007FF68B4D155C unpckhpd xmm4,xmm3 00007FF68B4D1560 addpd xmm5,xmm4 00007FF68B4D1564 movsd mmword ptr [r15+r11+20h],xmm5 00007FF68B4D156B movhpd qword ptr [r15+r11+30h],xmm5 00007FF68B4D1572 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D1576 add rdx,r14 00007FF68B4D1579 movsd xmm1,mmword ptr [rdx+r15+40h] 00007FF68B4D1580 movhpd xmm1,qword ptr [rdx+r15+48h] 00007FF68B4D1587 movsd xmm0,mmword ptr [rdx+r15+50h] 00007FF68B4D158E movaps xmm2,xmm1 00007FF68B4D1591 movhpd xmm0,qword ptr [rdx+r15+58h] 00007FF68B4D1598 unpcklpd xmm2,xmm0 00007FF68B4D159C unpckhpd xmm1,xmm0 00007FF68B4D15A0 addpd xmm2,xmm1 00007FF68B4D15A4 movsd mmword ptr [r15+r11+40h],xmm2 00007FF68B4D15AB movhpd qword ptr [r15+r11+50h],xmm2 00007FF68B4D15B2 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D15B6 add rdx,r14 00007FF68B4D15B9 movsd xmm4,mmword ptr [rdx+r15+60h] 00007FF68B4D15C0 movhpd xmm4,qword ptr [rdx+r15+68h] 00007FF68B4D15C7 movsd xmm3,mmword ptr [rdx+r15+70h] 00007FF68B4D15CE movaps xmm0,xmm4 00007FF68B4D15D1 movhpd xmm3,qword ptr [rdx+r15+78h] 00007FF68B4D15D8 unpcklpd xmm0,xmm3 00007FF68B4D15DC unpckhpd xmm4,xmm3 00007FF68B4D15E0 addpd xmm0,xmm4 00007FF68B4D15E4 movsd mmword ptr [r15+r11+60h],xmm0 00007FF68B4D15EB movhpd qword ptr [r15+r11+70h],xmm0
To the best of my understanding, the loop is unrolled several times, although rdx is updated every time one of these... operations... takes place (there's no reason to do that, the array isn't going anywhere). However, the mindblowing thing is what a single operation here does (I might have gotten something wrong, but the result is consistent with my understanding of this code).
It starts off by loading a number from temp1[0] or temp1[1] (can't quite tell which one) into both parts of the register xmm1 (real and imaginary part accordingly). So far so good. It then loads the next element of temp1[0] or temp1[1] (THE SAME ARRAY, this is no longer good...) into xmm0. So far each of the two registers holds a complex number, but not the ones we'd want. It then makes a copy of xmm1 and stores it in xmm2. The next move is somewhat amazing, the following two routines separate these values, so that xmm2 now holds both the imaginary parts (the part from xmm0 in the lower section and xmm1 in the higher) and xmm1 holds both real parts in the same order. It then proceeds to add the two values (so, real + imaginary, something that rarely makes sense, clearly the result will be zero, due to the way we filled the initial arrays) and saves them into two consequent real parts of the array M.
This was tested on i7-4720HQ with the following command line:
/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\DMRG.pch"
Adding AVX2 support simply vectorizes some of these operations, but the logic behind them (or rather lack of) remains the same, as does the result.
Changing the order within the code (imaginary first) also has no effect on the end results.
For reference, I am using Windows 10 with Visual Studio 2015 Update 3 and Intel Parallel Studio XE Update 3 (so, the latest compiler).
Thank you for your time.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Evidently, your volatile definition prevents the compiler from vectorizing. Does setting vectorization off explicitly (e.g. by -Qvec- or by #pragma novector) produce a correct result? Do you target SSE3? Can you simply assign each M[] in one step rather than by real and imaginary parts?
You could turn off unrolling by -Qunroll:0 or #pragma unroll(0).
I'm having difficulty saving your code in compilable form.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The volatile was added to test the hipothesis that the issue is with the +1, using volatile forces the compiler to avoid hardcoding
temp1[i+1]
The optimization was set to either AVX2 using /QxCORE-AVX2 and /QaxCORE-AVX2 or None, which I assumed would mean no vectorization.
Adding AVX2 produces the following (worse results), this is with #define of course:
0.00000000000000000000+0.00000000000000000000*I
0.00000000000000000000+-1.00000000000000000000*I
0.00000000000000000000+-2.00000000000000000000*I
0.00000000000000000000+-3.00000000000000000000*I
0.00000000000000000000+-4.00000000000000000000*I
0.00000000000000000000+-5.00000000000000000000*I
0.00000000000000000000+-6.00000000000000000000*I
0.00000000000000000000+-7.00000000000000000000*I
0.00000000000000000000+-8.00000000000000000000*I
0.00000000000000000000+-9.00000000000000000000*I
0.00000000000000000000+-10.00000000000000000000*I
0.00000000000000000000+-11.00000000000000000000*I
12.00000000000000000000+-12.00000000000000000000*I
0.00000000000000000000+-13.00000000000000000000*I
0.00000000000000000000+-14.00000000000000000000*I
0.00000000000000000000+-15.00000000000000000000*I
0.00000000000000000000+-16.00000000000000000000*I
0.00000000000000000000+-17.00000000000000000000*I
0.00000000000000000000+-18.00000000000000000000*I
0.00000000000000000000+-19.00000000000000000000*I
0.00000000000000000000+-20.00000000000000000000*I
0.00000000000000000000+-21.00000000000000000000*I
0.00000000000000000000+-22.00000000000000000000*I
0.00000000000000000000+-23.00000000000000000000*I
0.00000000000000000000+-24.00000000000000000000*I
25.00000000000000000000+-25.00000000000000000000*I
0.00000000000000000000+-26.00000000000000000000*I
0.00000000000000000000+-27.00000000000000000000*I
0.00000000000000000000+-28.00000000000000000000*I
0.00000000000000000000+-29.00000000000000000000*I
0.00000000000000000000+-30.00000000000000000000*I
0.00000000000000000000+-31.00000000000000000000*I
0.00000000000000000000+-32.00000000000000000000*I
0.00000000000000000000+-33.00000000000000000000*I
0.00000000000000000000+-34.00000000000000000000*I
0.00000000000000000000+-35.00000000000000000000*I
0.00000000000000000000+-36.00000000000000000000*I
0.00000000000000000000+-37.00000000000000000000*I
38.00000000000000000000+-38.00000000000000000000*I
0.00000000000000000000+-39.00000000000000000000*I
0.00000000000000000000+-40.00000000000000000000*I
0.00000000000000000000+-41.00000000000000000000*I
0.00000000000000000000+-42.00000000000000000000*I
0.00000000000000000000+-43.00000000000000000000*I
0.00000000000000000000+-44.00000000000000000000*I
0.00000000000000000000+-45.00000000000000000000*I
0.00000000000000000000+-46.00000000000000000000*I
0.00000000000000000000+-47.00000000000000000000*I
0.00000000000000000000+-48.00000000000000000000*I
0.00000000000000000000+-49.00000000000000000000*I
0.00000000000000000000+-50.00000000000000000000*I
51.00000000000000000000+-51.00000000000000000000*I
0.00000000000000000000+-52.00000000000000000000*I
0.00000000000000000000+-53.00000000000000000000*I
0.00000000000000000000+-54.00000000000000000000*I
0.00000000000000000000+-55.00000000000000000000*I
0.00000000000000000000+-56.00000000000000000000*I
0.00000000000000000000+-57.00000000000000000000*I
0.00000000000000000000+-58.00000000000000000000*I
0.00000000000000000000+-59.00000000000000000000*I
0.00000000000000000000+-60.00000000000000000000*I
0.00000000000000000000+-61.00000000000000000000*I
0.00000000000000000000+-62.00000000000000000000*I
0.00000000000000000000+-63.00000000000000000000*I
0.00000000000000000000+-64.00000000000000000000*I
0.00000000000000000000+-65.00000000000000000000*I
0.00000000000000000000+-66.00000000000000000000*I
0.00000000000000000000+-67.00000000000000000000*I
0.00000000000000000000+-68.00000000000000000000*I
0.00000000000000000000+-69.00000000000000000000*I
0.00000000000000000000+-70.00000000000000000000*I
0.00000000000000000000+-71.00000000000000000000*I
0.00000000000000000000+-72.00000000000000000000*I
0.00000000000000000000+-73.00000000000000000000*I
0.00000000000000000000+-74.00000000000000000000*I
0.00000000000000000000+-75.00000000000000000000*I
0.00000000000000000000+-76.00000000000000000000*I
0.00000000000000000000+-77.00000000000000000000*I
0.00000000000000000000+-78.00000000000000000000*I
0.00000000000000000000+-79.00000000000000000000*I
0.00000000000000000000+-80.00000000000000000000*I
0.00000000000000000000+-81.00000000000000000000*I
0.00000000000000000000+-82.00000000000000000000*I
0.00000000000000000000+-83.00000000000000000000*I
0.00000000000000000000+-84.00000000000000000000*I
0.00000000000000000000+-85.00000000000000000000*I
0.00000000000000000000+-86.00000000000000000000*I
0.00000000000000000000+-87.00000000000000000000*I
0.00000000000000000000+-88.00000000000000000000*I
0.00000000000000000000+-89.00000000000000000000*I
0.00000000000000000000+-90.00000000000000000000*I
0.00000000000000000000+-91.00000000000000000000*I
0.00000000000000000000+-92.00000000000000000000*I
0.00000000000000000000+-93.00000000000000000000*I
0.00000000000000000000+-94.00000000000000000000*I
0.00000000000000000000+-95.00000000000000000000*I
0.00000000000000000000+-96.00000000000000000000*I
0.00000000000000000000+-97.00000000000000000000*I
0.00000000000000000000+-98.00000000000000000000*I
0.00000000000000000000+-99.00000000000000000000*I
The imaginary part is still correct:
M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag; 00007FF7361017FC mov rbx,qword ptr [rax+rsi*8] 00007FF736101800 mov rax,qword ptr [rax+rsi*8+8] 00007FF736101805 vmovsd xmm0,qword ptr [r8+rbx-8] 00007FF73610180C vaddsd xmm1,xmm0,mmword ptr [r8+rax-8] 00007FF736101813 vmovsd qword ptr [r8+rcx-8],xmm1
The real part is now vectorized further using the larger registers:
M.real = temp1[ld + 0].real + temp1[ld + ena].real; 00007FF7361014F7 mov r12,qword ptr [rax+rsi*8] 00007FF7361014FB add r12,r10 00007FF7361014FE vmovupd ymm2,ymmword ptr [r12+r8] 00007FF736101504 vmovupd ymm1,ymmword ptr [r12+r8+20h] 00007FF73610150B vperm2f128 ymm4,ymm2,ymm1,20h 00007FF736101511 vperm2f128 ymm5,ymm2,ymm1,31h 00007FF736101517 vmovupd ymm2,ymmword ptr [r12+r8+40h] 00007FF73610151E vmovupd ymm1,ymmword ptr [r12+r8+60h] 00007FF736101525 vunpcklpd ymm3,ymm4,ymm5 00007FF736101529 vunpckhpd ymm0,ymm4,ymm5 00007FF73610152D vperm2f128 ymm5,ymm2,ymm1,20h 00007FF736101533 vperm2f128 ymm2,ymm2,ymm1,31h 00007FF736101539 vunpcklpd ymm4,ymm5,ymm2 00007FF73610153D vunpckhpd ymm5,ymm5,ymm2 00007FF736101541 vaddpd ymm0,ymm3,ymm0 00007FF736101545 vaddpd ymm4,ymm4,ymm5 00007FF736101549 vextractf128 xmm3,ymm0,1 00007FF73610154F vmovsd qword ptr [r8+r9],xmm0 00007FF736101555 vmovhpd qword ptr [r8+r9+10h],xmm0 00007FF73610155C vextractf128 xmm0,ymm4,1 00007FF736101562 vmovsd qword ptr [r8+r9+20h],xmm3 00007FF736101569 vmovhpd qword ptr [r8+r9+30h],xmm3 00007FF736101570 vmovsd qword ptr [r8+r9+40h],xmm4 00007FF736101577 vmovhpd qword ptr [r8+r9+50h],xmm4 00007FF73610157E vmovsd qword ptr [r8+r9+60h],xmm0 00007FF736101585 vmovhpd qword ptr [r8+r9+70h],xmm0 00007FF73610158C mov r14,qword ptr [rax+rsi*8] 00007FF736101590 add r14,r10 00007FF736101593 vmovupd ymm3,ymmword ptr [r14+r8+80h] 00007FF73610159D vmovupd ymm2,ymmword ptr [r14+r8+0A0h] 00007FF7361015A7 vmovupd ymm0,ymmword ptr [r14+r8+0C0h] 00007FF7361015B1 vmovupd ymm4,ymmword ptr [r14+r8+0E0h] 00007FF7361015BB vperm2f128 ymm1,ymm3,ymm2,20h 00007FF7361015C1 vperm2f128 ymm5,ymm3,ymm2,31h 00007FF7361015C7 vunpcklpd ymm3,ymm1,ymm5 00007FF7361015CB vunpckhpd ymm2,ymm1,ymm5 00007FF7361015CF vperm2f128 ymm1,ymm0,ymm4,20h 00007FF7361015D5 vperm2f128 ymm5,ymm0,ymm4,31h 00007FF7361015DB vunpcklpd ymm0,ymm1,ymm5 00007FF7361015DF vunpckhpd ymm1,ymm1,ymm5 00007FF7361015E3 vaddpd ymm2,ymm3,ymm2 00007FF7361015E7 vaddpd ymm0,ymm0,ymm1 00007FF7361015EB vextractf128 xmm3,ymm2,1 00007FF7361015F1 vextractf128 xmm1,ymm0,1 00007FF7361015F7 vmovsd qword ptr [r8+r9+80h],xmm2 00007FF736101601 vmovhpd qword ptr [r8+r9+90h],xmm2 00007FF73610160B vmovsd qword ptr [r8+r9+0A0h],xmm3 00007FF736101615 vmovhpd qword ptr [r8+r9+0B0h],xmm3 00007FF73610161F vmovsd qword ptr [r8+r9+0C0h],xmm0 00007FF736101629 vmovhpd qword ptr [r8+r9+0D0h],xmm0 00007FF736101633 vmovsd qword ptr [r8+r9+0E0h],xmm1 00007FF73610163D vmovhpd qword ptr [r8+r9+0F0h],xmm1
Using the volatile integer is still correct.
Adding #pragma novector before #pragma omp parallel for also solves the issue. So the issue is with the automatic vectorization of only the real part, the imaginary never does so (perhaps due to alignment?).
As for getting it running, I copied the code into a new project, set the compiler options and it compiles fine. But you do have to link the mkl libraries for MKL_Complex16, mkl_calloc and mkl_free, although it is possible that these are not relevant.
The following code without mkl produces the same error exact error (which makes sense):
#include <iostream> #include <iomanip> #include <new> #include <omp.h> using namespace std; typedef long long ll; #define mkl_calloc(X,Y,Z) calloc(X,Y) #define mkl_free(X) free(X) struct MKL_Complex16 { double real; double imag; }; ostream& operator<<(ostream& lhs, MKL_Complex16& rhs) { lhs << fixed << setprecision(20) << rhs.real << '+' << fixed << setprecision(20) << rhs.imag << "*I"; return lhs; } int main(void) { omp_set_num_threads(8); //volatile ll ena = 1; #define ena 1 ll temp_it = 100; ll ld = 2; MKL_Complex16** temp1 = new MKL_Complex16*[ld + ld](); for (ll i = 0; i < ld; i++) { temp1[ld + i] = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64); for (ll j = 0; j < temp_it; j++) { temp1[ld + i].real = i * j; temp1[ld + i] .imag = -i * j; } } MKL_Complex16* M = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64); #pragma omp parallel for for (ll i = 0; i < temp_it; i++) { M.real = temp1[ld + 0].real + temp1[ld + ena].real; M.imag = temp1[ld + 0].imag + temp1[ld + ena].imag; } for (ll j = 0; j < ld; j++) { mkl_free(temp1[ld + j]); } delete[] temp1; for (ll i = 0; i < temp_it; i++) { cout << M << endl; } mkl_free(M); system("pause"); return 0; }
The only real difference being that the arrays aren't aligned. Still, adding #pragma novector in front of #pragma omp parallel for does solve the issue. But considering that shouldn't be necessary and since no warnings are produced by the compiler, something goes terribly wrong.
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interestingly enough removing #pragma omp parallel for did previously help with this issue (while this was part of a larger project), at this point however, removing paralelization appears to cause ALL of the real parts to become 0.
As far as I can tell, the assembly code remains similar, adding the most unusual things together, for reasons unknown.
However, perhaps the issue is not with openmp, I can't say for sure, as I have not run that many tests.
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems to have been a while since the last reply. Have you made any progress trying to compile the code?
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this:
Using the volatile ena
... volatile ll ena = 1; ... for (ll j = 0; j < temp_it; j++) { temp1[ld + i].real = i * j * ena; temp1[ld + i] .imag = -i * j * ena; } ... } #pragma omp parallel for for (ll i = 0; i < temp_it; i++) { M.real = temp1[ld + 0].real + temp1[ld + 1].real; // +1 not +ena M.imag = temp1[ld + 0].imag + temp1[ld + 1].imag; } ...
If the above correct the problem, then this is indicative of the compiler optimization being overly aggressive (and with error) in trying to make the temp1 array loop invariant with known values for use in the parallel for loop. Initializing the temp1 array as above may influence the compiler into thinking the temp1 has indeterminable values, and thus not available for loop invariant substitution.
An alternate test would be to use the original code without the volatile, but print out the contents of temp1 _before_ you run parallel for loop.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim!
Adding the volatile, as you suggested, makes no difference, which makes sense, according to the assembly generated by the compiler. The current code, that can still easily reproduce the error is:
#include <iostream> #include <iomanip> #include <new> //#define _OVR using namespace std; typedef long long ll; struct Complex16 { double real; double imag; }; #ifdef _OVR Complex16 operator+(Complex16& lhs, Complex16& rhs) { return{ lhs.real + rhs.real, lhs.imag + rhs.imag }; } #endif ostream& operator<<(ostream& lhs, Complex16& rhs) { lhs << fixed << setprecision(20) << rhs.real << '+' << fixed << setprecision(20) << rhs.imag << "*I"; return lhs; } int main(void) { ll temp_it = 100; ll ld = 2; Complex16** temp1 = new Complex16*[ld + ld](); for (ll i = 0; i < ld; i++) { temp1[ld + i] = (Complex16*)calloc(temp_it, sizeof(Complex16)); for (ll j = 0; j < temp_it; j++) { temp1[ld + i].real = i * j; temp1[ld + i] .imag = -i * j; } } Complex16* M = (Complex16*)calloc(temp_it, sizeof(Complex16)); //#pragma novector for (ll i = 0; i < temp_it; i++) { #ifndef _OVR M.real = temp1[ld + 0].real + temp1[ld + 1].real; M.imag = temp1[ld + 0].imag + temp1[ld + 1].imag; #else M = temp1[ld + 0] + temp1[ld + 1]; #endif } for (ll j = 0; j < ld; j++) { free(temp1[ld + j]); } delete[] temp1; for (ll i = 0; i < temp_it; i++) { cout << M << endl; } free(M); fflush(stdin); getchar(); return 0; }
Here, the define _OVR simply overloads the addition operator, which I believe is what Tim P. had suggested earlier, but again, there is little difference. The result is the same, all real parts are wrong and all complex parts are correct. This is also without parallel execution, the issue is in the automatic vectorisation of the code, which is plain wrong.
I did check the values of the arrays, although if that went wrong that would, quite frankly, be even worse. Luckily, they are correct.
The only way to get the correct results is by adding
#pragma novector
This however is extremely problematic for two reasons:
a) If you don't know about this bug, your program will produce incorrect results for no apparent reason and what is worse, if this is a part of a larger program you might not even notice that the corrects are wrong, as the compiler doesn't issue any warning (because as far as it's concerned, nothing is wrong). As one of my colleagues commented, these are the kinds of things that bring planes down. :) (Hopefully not for real, of course)
b) Even if you do know about this and add this, it means that you can not take advantage of vectorisation, which is unfortunate to say the least.
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Marko,
Please let us know the exact command line options to compile above source code, I'm facing many compilation errors with above mentioned command line options.
Thanks,
Reddy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Reddy,
The command line as shown in VS2015 Update 3 using Intel Parallel Studio XE 2016 Update 3 & accompanying compiler:
/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\Test.pch"
Hasn't changed (except for the last command, because I have created a new project). I have copied the code from the forum back in to make sure I didn't forget to copy anything, but it works exactly the same as it did before and I am having no difficulties compiling and running the code, even when forcing a full rebuild.
What kind of errors are you getting?
You can try a simplified command line:
/GS- /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /D "_MBCS" /Qipo /Zc:forScope /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\Test.pch"
This should work if the offloads were causing your errors (could be if you're running on an older CPU, unlikely on anything newer than mine, but you never know). The results don't change at all, as the code in question (see the assembly) doesn't use any of that.
To clarify, the target is a 64bit application, as evident from the x64/ in the paths within the command line.
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have now copied the source code (now attached to the post), sent it to a linux cluster, compiled with the following command line:
icpc -static -std=c++11 -Wall -O3 -c test.cpp -o test.o
icpc test.o -o test
And received the same (wrong) result. There were no issues with the compilation. Forcing #pragma novector produces the correct result, much like on my local machine.
The Linux machines are also running the 2016 compiler, but I'm not sure about the update nor how to check that quickly.
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Marko,
Your problem is this statement:
temp1[ld + i] = (Complex16*)calloc(temp_it, sizeof(Complex16));
It is not performing an aligned allocation. Consider:
temp1[ld + i] = (Complex16*)aligned_malloc(64,temp_it*sizeof(Complex16));
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim!
That is not the case, if you look at the initial code provided above (first post) as well as the generated assembly (also posted above) it is clear that alignment plays no role in this error.
Also, even if that were the case, the automatic optimization, done by the compiler, should not make assumptions regarding the alignment of the data throughout the code, unless hinted otherwise. At the very least it should provide a warning message. But as I said, that is not the issue here, as proven by the initial, aligned, version.
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can't seem to locate any information about bugs that have been fixed for the compiler, is this issue fixed in the latest release?
Best regards,
Marko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If anyone is interested, the problem still exists in Intel C++ 17, the code attached to a previous post produces wrong results on both platforms (Windows & Linux).
Best regards,
Marko

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page