- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I recently started using Xeon Phi cards for parallel programming, so I am still a newbie in this field.
I wrote this code as a simple example to start understanding this fascinating world, but I got surprised when I looked at the time of executions.
When I run the code on the host, execution time is 0,08 s. When I run the code adding the pragma offload and pragma omp parallel for, execution time increase up to 9s!
When I compiled the codes, I used -O3 optimization for both of them.
Is there something I am missing?
Thanks for your help,
Flavio
#include<stdio.h> #ifdef _OPENMP #include<omp.h> #endif #define ALLOC alloc_if(1) free_if(0) #define RETAIN alloc_if(0) free_if(0) #define FREE alloc_if(0) free_if(1) #define LD long double #define MAX 100000 main(int argc, char **argv) { int i, j; LD *M = NULL; __declspec(target(mic))int cycles = 240; printf("array lenght: %d\n", cycles); //start time char fmt[64], buf1[64], buf2[64]; struct timeval tv; struct tm *tm; gettimeofday(&tv, NULL); if((tm = localtime(&tv.tv_sec)) != NULL){ strftime(fmt, sizeof fmt, "((%H*1440)+(%M*60)+%S,%%06u)", tm); snprintf(buf1, sizeof buf1, fmt, tv.tv_usec); } //array creation M = (LD*)calloc(cycles, sizeof(LD)); //allocating space on MIC #pragma offload target(mic) in(M:length(cycles) ALLOC) {} for (i=0; i<MAX; i++){ #pragma offload target(mic) inout(M:length(cycles) RETAIN) \ in(cycles) { #pragma omp parallel for private(j) #pragma ivdep for (j=0; j<cycles; j++) M+= 1; } //offload } //for //freeing space on MIC #pragma offload target(mic) nocopy(M:length(0) FREE) {} printf("number of cycles: %LG\n", M[0]); //tempo finale gettimeofday(&tv, NULL); if((tm = localtime(&tv.tv_sec)) != NULL){ strftime(fmt, sizeof fmt, "=((%H*1440)+(%M*60)+%S,%%06u)", tm); snprintf(buf2, sizeof buf2, fmt, tv.tv_usec); printf("%s-%s\n", buf2, buf1); } return 0; } //main
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some issues:
1) Any section of code you want to consider for parallelization should have sufficient amount of work per thread that can be done in parallel where the gain in parallel code at least exceeds the overhead in setting up the parallel region. In your sample program above, assuming you are running on 60 core 4 threads per core, your inner loop of 240 cycles reduces the work per thread to one issue of "M
2) long double on your host processor may be 8 bytes and supported in hardware, whereas long double on MIC may be 16 bytes, not supported by hardware (requires software emulation).
3) Depending on the compiler optimization wims, the "M
I suggest you perform two experiments:
a) Define LD as "double", set cycles to 2400000, set MAX to 10 (iow reduce outer loop by factor of 10,000 and increase inner loop by factor of 10,000)
b) Define LD as "long double", set cycles to 2400000, set MAX to 10, and add printf("sizeof(LD) = %d\n", sizeof(LD)); on the line in the offload immediately before your "#pragma omp..."
Then post your results back here.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
thankyou very much for your instant reply and for your help.
I have performed the experiments you suggested, and here are the results:
a) - serial CPU: 0,024275 s
array lenght: 2400000
number of cycles: 7.87677E-1759
- offload (240 threads): 0,668659 s
array lenght: 2400000
number of cycles: 1.02798E-2264
b) array lenght: 2400000
number of cycles: 10
0,700117 s
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
Kind regards,
Flavio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the size of LD on the CPU side?
In looking at the execution time for doubles, the offload was ~27.5 times slower.
I think I know what is going on. Your timed interval includes the very first offload. This offload will include the time to copy the MIC version of the application into the MIC and establish the initial OpenMP thread pool.
I suggest you place a loop around the timed interval, say 3 iterations, and observe the execution times to see what the overhead is for app copy + OpenMP pool initialization. Also do this for the host version of the program such that you can remove the host OpenMP pool initialization.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page