- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am creating a simple matrix multiplication procedure, operating on the Intel Xeon Phi architecture.
I am using, aligned data. However, if the matrices are allocated using dynamic memory (posix_memalign), the computation incurs in a severe slow down, i.e. for TYPE=float and 512x512 matrices takes ~0.55s in the dynamic case while in the other case ~0.07s.
On a different architecture (Intel Xeon E5-2650 @ 2.00GHz), the problem changes because the static allocated case doesn't calculate the matrix (it gives me all zeros when i print a random position of C, I think because the #pragma simd. Anyway, the dynamic allocating case takes about 0.08s.
Here is the code, i also attached the optimization reports of static & dynamic cases:
#define ROW 512 #define COLWIDTH 512 #define REPEATNTIMES 512 #include <sys/time.h> #include <stdio.h> #include <math.h> #include <stdlib.h> #define FTYPE float #define ALIGNMENT 128 double clock_it(void) { double duration = 0.0; struct timeval start; gettimeofday(&start, NULL); duration = (double)(start.tv_sec + start.tv_usec/1000000.0); return duration; } int main() { double execTime = 0.0; double startTime, endTime; int k, size1, size2, i, j; #ifdef STACK printf("Using Stack!\n"); FTYPE a[ROW][COLWIDTH]; FTYPE b[ROW][COLWIDTH]; FTYPE c[ROW][COLWIDTH]; for(i=0; i<ROW; i++){ for(j=0; j<COLWIDTH; j++){ a= 1.0f; b = 1.0f; c = 0.0f; } } #else printf("Using Heap!\n"); FTYPE **a; posix_memalign((void **) &a, ALIGNMENT, sizeof(FTYPE*)*ROW); FTYPE **b; posix_memalign((void **) &b, ALIGNMENT, sizeof(FTYPE*)*ROW); FTYPE **c; posix_memalign((void **) &c, ALIGNMENT, sizeof(FTYPE*)*ROW); for(i=0; i<ROW; i++){ posix_memalign((void **) &a, ALIGNMENT, sizeof(FTYPE)*COLWIDTH); posix_memalign((void **) &b, ALIGNMENT, sizeof(FTYPE)*COLWIDTH); posix_memalign((void **) &c, ALIGNMENT, sizeof(FTYPE)*COLWIDTH); for(j=0; j<COLWIDTH; j++){ a = 1.0f; b = 1.0f; c = 0.0f; } } #endif size1 = ROW; size2 = COLWIDTH; printf("\nROW:%d COL: %d\n",ROW,COLWIDTH); //start timing the matrix multiply code startTime = clock_it(); #ifndef STACK __assume_aligned(a, ALIGNMENT); __assume_aligned(b, ALIGNMENT); __assume_aligned(c, ALIGNMENT); #endif #pragma vector aligned for (i = 0; i < REPEATNTIMES; i++) { #pragma vector aligned for (k = 0; k < size1; k++) { #pragma simd #pragma vector aligned for (j = 0;j < size2; j++) { #ifndef STACK __assume_aligned(a, ALIGNMENT); __assume_aligned(b , ALIGNMENT); __assume_aligned(c, ALIGNMENT); #endif c += a * b ; } } } endTime = clock_it(); execTime = endTime - startTime; printf("Execution time is %2.3f seconds\n", execTime); printf("GigaFlops = %f\n", (((double)REPEATNTIMES * (double)COLWIDTH * (double)ROW * 2.0) / (double)(execTime))/1000000000.0); printf("Random c_i,j %f\n", c[rand()%512][rand()%512]); return 0; }
Any help is appreciated!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the assignment at line 83 to array c what you were intending?
It seems perhaps the use of "i" might not have been intended since that is associated with REPEATNTIMES whereas the arrays are sized based on ROWS and COLWIDTH; thus I wonder if line 83 maybe should be:
c
I see a difference in execution times with dynamic allocation when running natively on the coprocessor so I will investigate further and post again after I know more.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kevin Davis (Intel) wrote:
Is the assignment at line 83 to array c what you were intending?
It seems perhaps the use of "i" might not have been intended since that is associated with REPEATNTIMES whereas the arrays are sized based on ROWS and COLWIDTH; thus I wonder if line 83 maybe should be:
c
+= a * b ;
Thanks for answering. Maybe I didn't understood your question, but I was intending exactly that (the order i-k-j is only the usual optimization of the naive i-j-k algorithm for matrix multiplication). The names for the loop boundaries you're referring to are legacy names inherited by the intel sample vectorization code in the composerxe folder, from which I started after my old code started to run very slowly.
Kevin Davis (Intel) wrote:
I see a difference in execution times with dynamic allocation when running natively on the coprocessor so I will investigate further and post again after I know more.
I'm starting to think that my icc compiler isn't working well, is it possibile that it doesn't align the data?
Am I missing something?
P.S. the #define ALIGNMENT on top is set to 64, not to 128, I pasted wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was just noting where ROWS = REPEATNTIMES there isn't a concern, but where REPEATNTIMES > ROWS the loop accesses beyond the array row dimension for a and c.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page