- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am vectorizing the following matrix multiplication nested loop
----------------------------------------------------------------------
int col[100];
float a[100], x[100], t[100];
int i,j,k;
int n = 100;
for (i=0; i t = 0.0;
}
for (j=0; j for (k = col; k t = t + a * x;
}
}
-----------------------------------------------
As you can see icc will have to deal with alignment issues in the inner loop. I would like to use icc intrinsic operations to do this alignment but since I have never done it I am at a loss how to go about it. Any pointers? (I am running this in VS 2008)
Thanks
sachs
I am vectorizing the following matrix multiplication nested loop
----------------------------------------------------------------------
int col[100];
float a[100], x[100], t[100];
int i,j,k;
int n = 100;
for (i=0; i
}
for (j=0; j
}
}
-----------------------------------------------
As you can see icc will have to deal with alignment issues in the inner loop. I would like to use icc intrinsic operations to do this alignment but since I have never done it I am at a loss how to go about it. Any pointers? (I am running this in VS 2008)
Thanks
sachs
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
icc should do as well by auto-vectorization of the C source as you could do with intrinsics, and would save you from writing another version a year from now for AVX. If icc thinks the loop is too short for vectorization, try a larger problem or #pragma vector always or something like #pragma loop count(18).
You wouldn't use intrinsics for the alignment. You simply break the loop into 3 stages: up to 3 scalar iterations until an aligned boundary in t[] is reached, then a loop with parallel instructions until only a few iterations remain, finishing up with scalar iterations.
For Barcelona and Core i7 you would not worry about alignment of a[], simply using unaligned intrinsic load. For earlier Intel CPUs, you might make 2 versions, one for the case where a[] is aligned relative to t[] and the multiply intrinsic can take a memory operand, the other using split or unaligned loads.
You wouldn't use intrinsics for the alignment. You simply break the loop into 3 stages: up to 3 scalar iterations until an aligned boundary in t[] is reached, then a loop with parallel instructions until only a few iterations remain, finishing up with scalar iterations.
For Barcelona and Core i7 you would not worry about alignment of a[], simply using unaligned intrinsic load. For earlier Intel CPUs, you might make 2 versions, one for the case where a[] is aligned relative to t[] and the multiply intrinsic can take a memory operand, the other using split or unaligned loads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
icc should do as well by auto-vectorization of the C source as you could do with intrinsics, and would save you from writing another version a year from now for AVX. If icc thinks the loop is too short for vectorization, try a larger problem or #pragma vector always or something like #pragma loop count(18).
You wouldn't use intrinsics for the alignment. You simply break the loop into 3 stages: up to 3 scalar iterations until an aligned boundary in t[] is reached, then a loop with parallel instructions until only a few iterations remain, finishing up with scalar iterations.
For Barcelona and Core i7 you would not worry about alignment of a[], simply using unaligned intrinsic load. For earlier Intel CPUs, you might make 2 versions, one for the case where a[] is aligned relative to t[] and the multiply intrinsic can take a memory operand, the other using split or unaligned loads.
You wouldn't use intrinsics for the alignment. You simply break the loop into 3 stages: up to 3 scalar iterations until an aligned boundary in t[] is reached, then a loop with parallel instructions until only a few iterations remain, finishing up with scalar iterations.
For Barcelona and Core i7 you would not worry about alignment of a[], simply using unaligned intrinsic load. For earlier Intel CPUs, you might make 2 versions, one for the case where a[] is aligned relative to t[] and the multiply intrinsic can take a memory operand, the other using split or unaligned loads.
Thanks
sachs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an example of explicit alignment for SSE intrinsics in my posted C code at
http://sites.google.com/site/tprincesite/levine-callahan-dongarra-vectors
http://sites.google.com/site/tprincesite/levine-callahan-dongarra-vectors
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - sacharina
Hi,
I am vectorizing the following matrix multiplication nested loop
----------------------------------------------------------------------
int col[100];
float a[100], x[100], t[100];
int i,j,k;
int n = 100;
for (i=0; i t = 0.0;
}
for (j=0; j for (k = col; k t = t + a * x;
}
}
-----------------------------------------------
As you can see icc will have to deal with alignment issues in the inner loop. I would like to use icc intrinsic operations to do this alignment but since I have never done it I am at a loss how to go about it. Any pointers? (I am running this in VS 2008)
Thanks
sachs
I am vectorizing the following matrix multiplication nested loop
----------------------------------------------------------------------
int col[100];
float a[100], x[100], t[100];
int i,j,k;
int n = 100;
for (i=0; i
}
for (j=0; j
}
}
-----------------------------------------------
As you can see icc will have to deal with alignment issues in the inner loop. I would like to use icc intrinsic operations to do this alignment but since I have never done it I am at a loss how to go about it. Any pointers? (I am running this in VS 2008)
Thanks
sachs
Try this:
__declspec(align(16)) int col[1000] ;
__declspec(align(16)) float a[1000] ;
__declspec(align(16)) float x[1000] ;
__declspec(align(16)) float t[1000] ;
You'll also have to use some variant of the -msse compiler switch, and make sure you are using a pretty new compiler. I've noticed that older compilers didn't always want to vectorize well.
-Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jeff_keasler
Try this:
__declspec(align(16)) int col[1000] ;
__declspec(align(16)) float a[1000] ;
__declspec(align(16)) float x[1000] ;
__declspec(align(16)) float t[1000] ;
You'll also have to use some variant of the -msse compiler switch, and make sure you are using a pretty new compiler. I've noticed that older compilers didn't always want to vectorize well.
-Jeff
sachs
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page