strange performance with c++ 11.1 compiler and std::vector

bmort · ‎02-02-2010

After successfully installing the intel c++ compiler under ubuntu 9.10 64bit using the guide found at http://software.intel.com/en-us/articles/using-intel-compilers-for-linux-with-ubuntu/, i've been having some very strange issues with a test code comparing the performace of STL vector to c style arrays. The results i'm getting from my i5 750 using the c++ 11.1 intel compiler (and g++ 4.4.1 for comparison) are as follows:

Intel c++ compiler v11.1

===========================================================

Average time using C-style arrays: 0.64 seconds (3108.8 MFLOPS).
Average time using C++ vector: 3.64 seconds (549.5 MFLOPS).

gnu g++ 4.4.1

============================================================

Average time using C-style arrays: 1.25 seconds (1600.0 MFLOPS).
Average time using C++ vector: 1.23 seconds (1630.4 MFLOPS).

As you can see, while the performance for c style arrays is very favourable with the intel compiler something is going seriously wrong when using std::vector containers. So far i've had no luck on resolving this issue so any pointers as what might be going wrong or advice would be very welcome.

(Attached is the code used for this test)

Brandon_H_Intel · ‎02-02-2010

Where are you getting your cblas.h from - doesn't look like it's from MKL?

bmort · ‎02-03-2010

Thanks for the prompt replay ;) You are quite right, the code is using the gnu cblas for the functions test_cblas_array() and test_cblas_vector() and while i'm confident that using MKL would improve performance of these, it is in fact the results of functions test_array() and test_vector() that are giving me the results (posted above) i cant explain.

If however, i'm missing the point and i should be using some MKL version of std::vector i'd be very grateful to find out.

bmort · ‎02-03-2010

Did a bit more testing with -vec-report and -no-vec and it turns out that the reason for the difference between the gnu and intel compilers for the c style array funciton (test_array()) is the vectoriser sucessfully vectorising the inner loop for the case of the intel compiler only.

Anyway sadly this did not help explain the performace of icc/icpc on the test_vector() code other than it seems to have nothing directly related to the vectoriser.

Results from an Intel Core2 Duo CPU E6750 @ 2.66GHz using the same code as attached to the original post.

-- intel compiler with vectorisation disabled ---

Average time using C-style arrays: 0.15 seconds (1704.5 MFLOPS).
Average time using C++ vector: 0.43 seconds (576.9 MFLOPS).

-- intel compiler with vectorisation enabled ---

Average time using C-style arrays: 0.08 seconds (3125.0 MFLOPS).
Average time using C++ vector: 0.43 seconds (576.9 MFLOPS).

-- gnu compiler with or without vectorisation (makes no difference) ---

Average time using C-style arrays: 0.15 seconds (1666.7 MFLOPS).
Average time using C++ vector: 0.15 seconds (1630.4 MFLOPS).

AdamB1 · ‎02-03-2010

The reason for the difference between C style arrays and C++ style arrays using the Intel compiler on windows (and I would assume on Linux as well)is directly related to the number of function calls each makes. The statement

c[ij] += a[ki] * b[kj];

Resolves to one line of code for the C style array while it resolves to 5 for the C++ style. For the C++ style array, you get the following calls for every [].

size() - returns the size of the vector
a test condition
size() - returns the size of the vector
statement to return the value

From here, I think it is compiler optimization. Running those statements with matrix size of 1000 and debugging values turned off results in1.22 seconds for a C-style arrays and 2.64 for the C++ style array. You really don't want to run the testwith debugging information included (no -g option in g++) because it turns off a lot of the optimization for the series of commands used to resolve the [] operators.

Note that adding a range check to the C style array on my computer (the code line if(ki < matrixSize && kj < matrixSize) right above the c[ij] statement) changes the times to 2.62 for a C style array and 2.64 for the C++ style array. Just for comparison, Microsoft's compiler gave 3.95 for C style arrays and 5.22 with the range check included so different optimization is definitely being done depending on compiler.

So if I had to guess what is causing the actual difference, I would say that g++ is not checking to make sure the indices passed to vector are actually inside the vector bounds during the looping cycle. If they are linking against the same library, then some sort of optimization in g++ would be either checking at compile time or just removing the check. The easy way to test that would be to try and go past the range of the vector and see what spits out - although that is not the safest option. It would be the next thing I would check if I were you. Unfortunately, as you have to turn debugging information off to get the code to optimize correctly you can't just step through the code with gdb.

TimP · ‎02-03-2010

You are pushing the limits of optimization in several ways. Apparently, icc sees the arrays created by separate malloc() calls and relies on them not overlapping, while g++, if you are using a recent version for which -O3 requests auto-vectorization, does not.

Brandon_H_Intel · ‎02-03-2010

Hibmort,

While bringing up MKL was not unintentional, I was more just trying to determine where your cblas.h was coming from so I could build and run your test case. That being said, I downloaded GSL from http://www.gnu.org/software/gsl/ but it prepends all its headers with gsl_, so I get gsl_cblas.h. Is that the right package, or am I missing something?

bmort · ‎02-04-2010

Thanks for the replies and the interesting feedback. I'm going to do a bit of testing with the debuger and look into the source code of the stl vector class template to see if I can resolve this further so i'll post again once i've done a bit of digging.

Anyway apologies for the confusion over cblas.h, the one i'm using is the version from the ubuntu 9.10 repository which after a bit of searching around i've found to be the ATLAS cblas (http://www.netlib.org/atlas/) package.

bmort · ‎02-04-2010

While the problem is quite likely something to do with the way g++ is managing to optimise after a bit of searching though the stl_vector.h code i think i can eliminate range checking as a possibility for the speed difference as the code for the [] operator.

---- snip stl_vector.h ---------------------------------------------------------

/**
* @brief Subscript access to the data contained in the %vector.
* @param n The index of the element for which data should be
* accessed.
* @return Read-only (constant) reference to data.
*
* This operator allows for easy, array-style, data access.
* Note that data access with this operator is unchecked and
* out_of_range lookups are not defined. (For checked lookups
* see at().)
*/
const_reference
operator[](size_type __n) const
{ return *(this->_M_impl._M_start + __n); }

--------------------------------------------------------------------------------

Anyway the problem seems to be defiantly one of some optimisation g++ 4.4 is managing and icc 11.1 is not as compiling the code with different optimisation flags and a matrix size of 500 (running on Core 2 E6750) i get:

---- g++ with -O0 ----

Average time using C-style arrays: 1.01 seconds (246.7 MFLOPS).
Average time using C++ vector: 1.91 seconds (131.1 MFLOPS).

---- icc with -O0 ----

Average time using C-style arrays: 1.02 seconds (245.1 MFLOPS).
Average time using C++ vector: 2.00 seconds (125.2 MFLOPS).

--- g++ with -O1 ---

Average time using C-style arrays: 0.19 seconds (1339.3 MFLOPS).
Average time using C++ vector: 0.43 seconds (581.4 MFLOPS)

--- icc with -O1 ---

Average time using C-style arrays: 0.14 seconds (1744.2 MFLOPS).
Average time using C++ vector: 0.43 seconds (581.4 MFLOPS).

--- g++ with -O2 ---

Average time using C-style arrays: 0.15 seconds (1630.4 MFLOPS).
Average time using C++ vector: 0.16 seconds (1595.7 MFLOPS).
--- icc with -O2 ---

Average time using C-style arrays: 0.08 seconds (3125.0 MFLOPS).
Average time using C++ vector: 0.43 seconds (581.4 MFLOPS).

(note -O3 makes no difference in either the g++ or icc compilers so i've omitted the results)

So it would seem there is some optimsation between O1 and O2/O3 in the gnu compiler that the intel compiler is not finding for this code when using std::vector [] operator. I'm going to keep playing around with different flags but this is the best I can find so far.

AdamB1 · ‎02-05-2010

O2 is instruction ordering. That is actually why I said I thought it might be a check on in range for the [] operator. With debugging on, the windows versions of the vector stl library do include range checking. However, as I mentioned in my first post debugging needs to be turned off to avoid that. So when I did and checked the assembly I got the following segments for the two loops:

C++ style (the actual inner loop - there are some other instructions as well but they are fairly equal between C and C++):

[bash]000000013FB5173B lea eax,[r8+rdx]
000000013FB5173F lea edi,[r14+rdx] 
000000013FB51743 movsxd rax,eax
000000013FB51746 cmp rax,qword ptr [rbp+100h] 
000000013FB5174D jae test_vector+36Dh (13FB518EDh)
000000013FB51753 movsxd rdi,edi 000000013FB51756 cmp rdi,r11
000000013FB51759 jae test_vector+36Dh (13FB518EDh)
000000013FB5175F movsd xmm0,mmword ptr [r13+rax*8]
000000013FB51766 mulsd xmm0,mmword ptr [r12+rdi*8]
000000013FB5176C inc edx 000000013FB5176E cmp edx,ebx
000000013FB51770 addsd xmm1,xmm0 
000000013FB51774 movaps xmm0,xmm1
000000013FB51777 jl test_vector+1BBh (13FB5173Bh) [/bash]

For the C style array I got the following:

[bash]000000013FB51185 lea rcx,[rbx+r9*8]
000000013FB51189 movsd xmm0,mmword ptr [rcx+rdx*8]
000000013FB5118E mulsd xmm0,mmword ptr [r11+rdx*8]
000000013FB51194 inc rdx 000000013FB51197 cmp rdx,rsi
000000013FB5119A addsd xmm1,xmm0 000000013FB5119E jl main+189h (13FB51189h) 
[/bash]

The real difference between these two is a couple of compare statements. If you add the if statement I linked in my first post, it actually turns the C assembly into something like this:

[bash]000000013FD610E8 lea r15d,[r10+rdx] 
000000013FD610EC lea eax,[r13+rdx] 
000000013FD610F1 cmp r15d,edi 
000000013FD610F4 jge main+112h (13FD61112h) 
000000013FD610F6 cmp eax,edi 
000000013FD610F8 jge main+112h (13FD61112h)
000000013FD610FA movsxd r15,r15d 
000000013FD610FD movsd xmm0,mmword ptr [rsi+r15*8] 
000000013FD61103 movsxd rax,eax 
000000013FD61106 mulsd xmm0,mmword ptr [rbx+rax*8] 
000000013FD6110B addsd xmm1,xmm0 
000000013FD6110F movaps xmm0,xmm1 
000000013FD61112 inc edx 
000000013FD61114 cmp edx,r12d 
000000013FD61117 jl main+0E8h (13FD610E8h) [/bash]

Which is actually very similar to the C++ assembly. So I tried to chase that down. The two compare statements (cmp rdi,r11 and cmp rax,qword ptr [rbp+100h]) are both comparing the current counter with the dimension of the matrix (which also happens to be the length of the arrays). So that is where I came up with range checking as the possible culprit. On windows at least, this seems reasonable as the code for the [] operator is:

[bash]reference operator[](size_type _Pos) {
 // subscript mutable sequence #if _HAS_ITERATOR_DEBUGGING 
      if (size() <= _Pos) { _DEBUG_ERROR("vector subscript out of range");
      _SCL_SECURE_OUT_OF_RANGE; }
 #endif _HAS_ITERATOR_DEBUGGING
 _SCL_SECURE_VALIDATE_RANGE(_Pos < size());
 return (*(_Myfirst + _Pos)); }[/bash]

I believe on windows the culprit is the _SCL_SECURE_VALIDATE_RANGE statement. If I am reading the code correctly, that would add the checks for in range to the [] operator. So that brings us back to your situation with Linux. After you pointed out the code, I looked at the stl_vector library and came up with the same thing you did. The code seems to just call a c style array. I tried checking to see if either the argument to [] or _M_impl._M_start could resolve as iterators as that would cause the issue, but both seem to only be able to resolve as the types you wouldwant them too.

I'm afraid I can't be of too much further use at this point. I am pretty sure thedifference onmy machineis being caused by the range checking for the [] operator. As it looks like that is almost definately not the case for Linux and I don't have ready access to the intel compiler on my Linux boxes, I would only be guessing.

TimP · ‎02-05-2010

I struggled recently with a case where didn't optimize (except with run-time failure) so gave up and used cblas_mkl function call.

JenniferJ · ‎02-17-2010

It maybe related to inlining optimization. I'll send it to compiler engineers to find out why.

I found following interesting results on Windows: (well on Linux, there's no change between -inline-level=1 or 2)

>>s-icl-O2-Ob0.exe 500

Average time using C-style arrays: 0.28 seconds (909.1 MFLOPS).

Average time using C++ vector: 5.49 seconds (45.5 MFLOPS).

>>s-icl-O2-Ob1.exe 500

Average time using C-style arrays: 0.28 seconds (909.1 MFLOPS).

Average time using C++ vector: 1.18 seconds (211.5 MFLOPS).

If I heard any news, I'll let you know.

Thanks for the interesting testcase.

Jennifer

JenniferJ · ‎02-18-2010

It's not inline issue. It's optimization related to alias.

Try adding this option "-ansi-alias". It helps some, but still the performance isn't great. But wait for our next generation, there's abig improvement for this case.

Jennifer