- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compile this for use on AVX system (Intel C++) and compare runtimes of two loops.
The first loop (not using dvec.h) generates nice vector code but incorporates vinsertf128's and vextractf128 in the preponderant computational section of this loop. This shows as 7 memory references.
The second loop (using dvec.h) generates nice vector code as well, but does not use the the vinsert/vextract in the preponderant computational section of this loop. This shows as4 memory references.
*** the first loop runs faster??? By about 2x!!!!
In looking at the disassembly it is interleaving reads (not unrolled) in the first loop but not in the second loop.
This may be a good example for your compiler optimization team to examine for optimization opportunities.
[cpp]// Felix.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "dvec.h" #include "omp.h" #include
#define USE_AVX
#ifdef USE_AVX
struct aosoa
{
F32vec8 a1;
F32vec8 a2;
F32vec8 a3;
F32vec8 a4;
};
const int VecWidth = 8;
#else
struct aosoa
{
F32vec4 a1;
F32vec4 a2;
F32vec4 a3;
F32vec4 a4;
};
const int VecWidth = 4;
#endif
const int N = 64*1024*1024;
const int Nvecs = N / VecWidth;
float* a1; // ;
float* a2; // ;
float* a3; // ;
float* a4; // ;
aosoa* s; // [Nvecs];
float coeff1 = 1.2345f;
float coeff2 = .987654321f;
int _tmain(int argc, _TCHAR* argv[])
{
a1 = new float;
a2 = new float;
a3 = new float;
a4 = new float;
s = new aosoa[Nvecs];
std::cout << "a1 " << &a1[0] << std::endl;
std::cout << "a2 " << &a2[0] << std::endl;
std::cout << "a3 " << &a3[0] << std::endl;
std::cout << "a4 " << &a4[0] << std::endl;
std::cout << "s " << &s[0] << std::endl;
for(int i=0; i = 1.0f / float(i);
a2 = 2.0f / float(i);
a3 = 3.0f / float(i);
a4 = 4.0f / float(i);
}
for(int j=0; j.a1 = (float(i) + 1.0f) / float(j*VecWidth + i);
}
double totAOS = 0.0;
double totPAOS = 0.0;
for(int iRep=0; iRep < 50; ++iRep)
{
// test 1
double t0 = omp_get_wtime();
for (int i=0; i = a2*a2*coeff1*a3 - a2 + coeff2*a1;
}
double t1 = omp_get_wtime();
totAOS += t1 - t0;
// test 2
for (int i=0; i.a3 = s.a2*s.a2*coeff1*s.a3 - s.a2 + s.a1*coeff2;
}
double t2 = omp_get_wtime();
totPAOS += t2 - t1;
}
std::cout << totAOS << " " << totPAOS << std::endl;
return 0;
}
[/cpp]
Jim Dempsey
The first loop (not using dvec.h) generates nice vector code but incorporates vinsertf128's and vextractf128 in the preponderant computational section of this loop. This shows as 7 memory references.
The second loop (using dvec.h) generates nice vector code as well, but does not use the the vinsert/vextract in the preponderant computational section of this loop. This shows as4 memory references.
*** the first loop runs faster??? By about 2x!!!!
In looking at the disassembly it is interleaving reads (not unrolled) in the first loop but not in the second loop.
This may be a good example for your compiler optimization team to examine for optimization opportunities.
[cpp]// Felix.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "dvec.h" #include "omp.h" #include
Jim Dempsey
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim,
I've seen you already posted a link in the compiler forum to here:
As it's about the C++ class libraries it fits better to the compiler forum. Let's continue the discussion there.
I'll close the thread at hand.
Best regards,
Georg Zitzlsberger
I've seen you already posted a link in the compiler forum to here:
As it's about the C++ class libraries it fits better to the compiler forum. Let's continue the discussion there.
I'll close the thread at hand.
Best regards,
Georg Zitzlsberger

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page