Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

The AVX not effectives

peryli
Beginner
612 Views

my program for solving the lower triangular equations,i testthe serial version and the version with AVX, result is:

Matrix Order : 10000x4
Matrix Band : 5000x4
Data Layout : 4x4 columned
Data Type : double
CPU : Corei7 2600
memory size : 16G
platform : VS2010

serial version: 0.357240s ( 100 cycles averaged )

SMID(AVX) version : 0.338360s ( 100 cycles averaged )

my question is theversion with AVX nearly unavailable, why?

this my code:

[bash]inline void solveL_pivot( double* x, const double* L )

{

double e0=x[ 0 ];

double e1=x[ 1 ];

double e2=x[ 2 ];

double e3=x[ 3 ];

e0/=L[ 0 ]; e1-=e0*L[ 1 ]; e2-=e0*L[ 2 ]; e3-=e0*L[ 3 ];

e1/=L[ 5 ]; e2-=e1*L[ 6 ]; e3-=e1*L[ 7 ];

e2/=L[ 10 ]; e3-=e2*L[ 11 ];

e3/=L[ 15 ];

x[ 0 ]=e0;

x[ 1 ]=e1;

x[ 2 ]=e2;

x[ 3 ]=e3;

}



void vsolveL_band( double* o, const double* c, unsigned int order, unsigned int band ) { double* xproxy; double* Lproxy; __m256d ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7; double* x=o; double* L=( double* )c; unsigned int const stride=band<<4; unsigned int const d=order-band; unsigned int n=band-1; for( unsigned int i=0; id ){ --n; } xproxy=x; Lproxy=( double* )L; solveL_pivot( xproxy, Lproxy ); ymm0=_mm256_broadcast_sd( &xproxy[ 0 ] ); ymm1=_mm256_broadcast_sd( &xproxy[ 1 ] ); ymm2=_mm256_broadcast_sd( &xproxy[ 2 ] ); ymm3=_mm256_broadcast_sd( &xproxy[ 3 ] ); for( unsigned int k=0; kd ){ --n; } solveL_pivot( o, c ); solveL_update( o, c+16, n ); o+=4; c+=stride; } solveL_pivot( o, c ); } [/bash]
0 Kudos
2 Replies
Brijender_B_Intel
612 Views
Hi,
Can you please confirm two things (I have nt looked the code yet):
1. Is it win7 64bit or 32bit
2. VS2010, is it SP1 or not. If it is not SP1, use SP1. VS2010 had AVX performance issues.
0 Kudos
peryli
Beginner
612 Views
win7 64bit
VS2010+SP1

It achieved better speed-up for the decompose computing with AVX, or the math op not enough overlap the latency of data access in backsubstitution? I`m at a loss_ _
0 Kudos
Reply