Community
cancel
Showing results for
Did you mean:
Beginner
108 Views

using icc intrinsics to do alignment

Hi,

I am vectorizing the following matrix multiplication nested loop

----------------------------------------------------------------------
int col[100];
float a[100], x[100], t[100];
int i,j,k;
int n = 100;

for (i=0; i t = 0.0;
}

for (j=0; j for (k = col; k t = t + a * x;
}
}
-----------------------------------------------
As you can see icc will have to deal with alignment issues in the inner loop. I would like to use icc intrinsic operations to do this alignment but since I have never done it I am at a loss how to go about it. Any pointers? (I am running this in VS 2008)

Thanks

sachs
5 Replies
Black Belt
108 Views
icc should do as well by auto-vectorization of the C source as you could do with intrinsics, and would save you from writing another version a year from now for AVX. If icc thinks the loop is too short for vectorization, try a larger problem or #pragma vector always or something like #pragma loop count(18).
You wouldn't use intrinsics for the alignment. You simply break the loop into 3 stages: up to 3 scalar iterations until an aligned boundary in t[] is reached, then a loop with parallel instructions until only a few iterations remain, finishing up with scalar iterations.
For Barcelona and Core i7 you would not worry about alignment of a[], simply using unaligned intrinsic load. For earlier Intel CPUs, you might make 2 versions, one for the case where a[] is aligned relative to t[] and the multiply intrinsic can take a memory operand, the other using split or unaligned loads.

Beginner
108 Views
Quoting - tim18
icc should do as well by auto-vectorization of the C source as you could do with intrinsics, and would save you from writing another version a year from now for AVX. If icc thinks the loop is too short for vectorization, try a larger problem or #pragma vector always or something like #pragma loop count(18).
You wouldn't use intrinsics for the alignment. You simply break the loop into 3 stages: up to 3 scalar iterations until an aligned boundary in t[] is reached, then a loop with parallel instructions until only a few iterations remain, finishing up with scalar iterations.
For Barcelona and Core i7 you would not worry about alignment of a[], simply using unaligned intrinsic load. For earlier Intel CPUs, you might make 2 versions, one for the case where a[] is aligned relative to t[] and the multiply intrinsic can take a memory operand, the other using split or unaligned loads.

Thanks tim18. I agree with you but in this case I am interested in using this simple example as a learning project about using icc intrinsics since it is quite simple. But I am not sure how to go about it. A little step by step guide would be much appreciated. I will then follow on with a comparison with auto-vectorization

Thanks

sachs
Black Belt
108 Views
I have an example of explicit alignment for SSE intrinsics in my posted C code at

Beginner
108 Views
Quoting - sacharina
Hi,

I am vectorizing the following matrix multiplication nested loop

----------------------------------------------------------------------
int col[100];
float a[100], x[100], t[100];
int i,j,k;
int n = 100;

for (i=0; i t = 0.0;
}

for (j=0; j for (k = col; k t = t + a * x;
}
}
-----------------------------------------------
As you can see icc will have to deal with alignment issues in the inner loop. I would like to use icc intrinsic operations to do this alignment but since I have never done it I am at a loss how to go about it. Any pointers? (I am running this in VS 2008)

Thanks

sachs

Try this:

__declspec(align(16)) int col[1000] ;
__declspec(align(16)) float a[1000] ;
__declspec(align(16)) float x[1000] ;
__declspec(align(16)) float t[1000] ;

You'll also have to use some variant of the -msse compiler switch, and make sure you are using a pretty new compiler. I've noticed that older compilers didn't always want to vectorize well.

-Jeff
Beginner
108 Views
Quoting - jeff_keasler

Try this:

__declspec(align(16)) int col[1000] ;
__declspec(align(16)) float a[1000] ;
__declspec(align(16)) float x[1000] ;
__declspec(align(16)) float t[1000] ;

You'll also have to use some variant of the -msse compiler switch, and make sure you are using a pretty new compiler. I've noticed that older compilers didn't always want to vectorize well.

-Jeff
Thanks Jess and tim18. I should be good to go with all these examples

sachs