Missed optimization opportunities using dvec.h

jimdempseyatthecove · ‎08-08-2012

Compile this for use on AVX system (Intel C++) and compare runtimes of two loops.

The first loop (not using dvec.h) generates nice vector code but incorporates vinsertf128's and vextractf128 in the preponderant computational section of this loop. This shows as 7 memory references.

The second loop (using dvec.h) generates nice vector code as well, but does not use the the vinsert/vextract in the preponderant computational section of this loop. This shows as4 memory references.

*** the first loop runs faster??? By about 2x!!!!

In looking at the disassembly it is interleaving reads (not unrolled) in the first loop but not in the second loop.

This may be a good example for your compiler optimization team to examine for optimization opportunities.
[cpp]// Felix.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "dvec.h" #include "omp.h" #include #define USE_AVX #ifdef USE_AVX struct aosoa { F32vec8 a1; F32vec8 a2; F32vec8 a3; F32vec8 a4; }; const int VecWidth = 8; #else struct aosoa { F32vec4 a1; F32vec4 a2; F32vec4 a3; F32vec4 a4; }; const int VecWidth = 4; #endif const int N = 64*1024*1024; const int Nvecs = N / VecWidth; float* a1; // ; float* a2; // ; float* a3; // ; float* a4; // ; aosoa* s; // [Nvecs]; float coeff1 = 1.2345f; float coeff2 = .987654321f; int _tmain(int argc, _TCHAR* argv[]) { a1 = new float; a2 = new float; a3 = new float; a4 = new float; s = new aosoa[Nvecs]; std::cout << "a1 " << &a1[0] << std::endl; std::cout << "a2 " << &a2[0] << std::endl; std::cout << "a3 " << &a3[0] << std::endl; std::cout << "a4 " << &a4[0] << std::endl; std::cout << "s " << &s[0] << std::endl; for(int i=0; i = 1.0f / float(i); a2 = 2.0f / float(i); a3 = 3.0f / float(i); a4 = 4.0f / float(i); } for(int j=0; j.a1 = (float(i) + 1.0f) / float(j*VecWidth + i); } double totAOS = 0.0; double totPAOS = 0.0; for(int iRep=0; iRep < 50; ++iRep) { // test 1 double t0 = omp_get_wtime(); for (int i=0; i = a2*a2*coeff1*a3 - a2 + coeff2*a1; } double t1 = omp_get_wtime(); totAOS += t1 - t0; // test 2 for (int i=0; i.a3 = s.a2*s.a2*coeff1*s.a3 - s.a2 + s.a1*coeff2; } double t2 = omp_get_wtime(); totPAOS += t2 - t1; } std::cout << totAOS << " " << totPAOS << std::endl; return 0; } [/cpp]
Jim Dempsey

Georg_Z_Intel · ‎08-09-2012

Hello Jim,

I've seen you already posted a link in the compiler forum to here:

http://software.intel.com/en-us/forums/showthread.php?t=107141

As it's about the C++ class libraries it fits better to the compiler forum. Let's continue the discussion there.

I'll close the thread at hand.

Best regards,

Georg Zitzlsberger