- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles Congdon,
I am using Phi Coprocessor to doing a project. When I ran this function in CPU, 24 cores, the time is about 8.5 msec. But It ran about 700 msec pn Phi. In order to figure it our, I print the time at a begin of openmp part of offload code:
#pragma omp parallel for
for( size_t y = 0; y < 192; y++)
{
tmp[omp_get_thread_num()] = When();
__attribute__((target(mic))) double When()
{
#ifndef _WIN32
static struct timeval tp;
gettimeofday(&tp, NULL);
double t = (double)tp.tv_sec;
double t1 = (double) tp.tv_usec;
return (t + t1 * 1e-6);
#else
clock_t start = clock( );
double duration = (double)start / CLOCKS_PER_SEC;
return duration;
#endif
}
The results of time are:
id = 1, time = 1408126111.008337
id = 2, time = 1408126111.008337
id = 3, time = 1408126111.000000
id = 4, time = 1408126111.007646
id = 5, time = 1408126111.000000
id = 6, time = 1408126111.000000
id = 7, time = 1408126111.008337
id = 8, time = 1408126111.007646
id = 9, time = 1408126119.337340
id = 10, time = 1408126111.000007
id = 11, time = 1408126111.000000
id = 12, time = 1408126119.337340
id = 13, time = 1408126111.000007
id = 14, time = 1408126111.000007
id = 15, time = 1408126111.000000
id = 16, time = 1408126119.337340
id = 17, time = 1408126111.008337
id = 18, time = 1408126111.000007
id = 19, time = 1408126111.000007
id = 20, time = 1408126111.008337
id = 21, time = 1408126111.008337
id = 22, time = 1408126111.008337
id = 23, time = 1408126111.015979
id = 24, time = 1408126111.000015
id = 25, time = 1408126111.007645
id = 26, time = 1408126111.015983
id = 27, time = 1408126111.000000
id = 28, time = 1408126119.337340
id = 29, time = 1408126111.000000
id = 30, time = 1408126111.007644
id = 31, time = 1408126111.008337
id = 32, time = 1408126111.015964
id = 33, time = 1408126111.000007
id = 34, time = 1408126111.000007
id = 35, time = 1408126111.000000
id = 36, time = 1408126111.000007
id = 37, time = 1408126111.000000
id = 38, time = 1408126111.000000
id = 39, time = 1408126111.000000
id = 40, time = 1408126111.007639
id = 41, time = 1408126111.000000
id = 42, time = 1408126111.000000
id = 43, time = 1408126111.008337
id = 44, time = 1408126111.007643
id = 45, time = 1408126111.007644
id = 46, time = 1408126111.007644
id = 47, time = 1408126111.015980
id = 48, time = 1408126111.008337
id = 49, time = 1408126111.000000
id = 50, time = 1408126111.015983
id = 51, time = 1408126111.000007
id = 52, time = 1408126119.337340
id = 53, time = 1408126111.007640
id = 54, time = 1408126111.000000
id = 55, time = 1408126111.000000
id = 56, time = 1408126111.000000
id = 57, time = 1408126111.008337
id = 58, time = 1408126111.000007
id = 59, time = 1408126111.008337
id = 60, time = 1408126111.015960
id = 61, time = 1408126119.337340
id = 62, time = 1408126111.008337
id = 63, time = 1408126111.000000
id = 64, time = 1408126111.007646
id = 65, time = 1408126111.000000
id = 66, time = 1408126111.000000
id = 67, time = 1408126111.008337
id = 68, time = 1408126111.007646
id = 69, time = 1408126119.337340
id = 70, time = 1408126111.000007
id = 71, time = 1408126111.000000
id = 72, time = 1408126111.008337
id = 73, time = 1408126111.000007
id = 74, time = 1408126111.000000
id = 75, time = 1408126111.008337
id = 76, time = 1408126111.008337
id = 77, time = 1408126111.008337
id = 78, time = 1408126111.000007
id = 79, time = 1408126111.000007
id = 80, time = 1408126119.337340
id = 81, time = 1408126119.337340
id = 82, time = 1408126111.008337
id = 83, time = 1408126111.015979
id = 84, time = 1408126111.015964
id = 85, time = 1408126111.007645
id = 86, time = 1408126111.015983
id = 87, time = 1408126111.000000
id = 88, time = 1408126111.008337
id = 89, time = 1408126111.000007
id = 90, time = 1408126111.000000
id = 91, time = 1408126111.008337
id = 92, time = 1408126111.000015
id = 93, time = 1408126111.007646
id = 94, time = 1408126111.000007
id = 95, time = 1408126111.000000
id = 96, time = 1408126111.007646
id = 97, time = 1408126111.000000
id = 98, time = 1408126111.000000
id = 99, time = 1408126111.000000
id = 100, time = 1408126111.007639
id = 101, time = 1408126111.000000
id = 102, time = 1408126111.000000
id = 103, time = 1408126111.008337
id = 104, time = 1408126111.007643
id = 105, time = 1408126111.007644
id = 106, time = 1408126111.007644
id = 107, time = 1408126111.015980
id = 108, time = 1408126119.337340
id = 109, time = 1408126111.000000
id = 110, time = 1408126111.015983
id = 111, time = 1408126111.000007
id = 112, time = 1408126111.008337
id = 113, time = 1408126111.007640
id = 114, time = 1408126111.000000
id = 115, time = 1408126111.000000
id = 116, time = 1408126111.000000
id = 117, time = 1408126111.008337
id = 118, time = 1408126111.000007
id = 119, time = 1408126111.008337
id = 120, time = 1408126119.337340
id = 121, time = 1408126119.337340
id = 122, time = 1408126111.008337
id = 123, time = 1408126111.000000
id = 124, time = 1408126111.007646
id = 125, time = 1408126111.000000
id = 126, time = 1408126111.000000
id = 127, time = 1408126111.008337
id = 128, time = 1408126111.007646
id = 129, time = 1408126119.337340
id = 130, time = 1408126111.000007
id = 131, time = 1408126111.000000
id = 132, time = 1408126111.008337
id = 133, time = 1408126111.000007
id = 134, time = 1408126111.000007
id = 135, time = 1408126111.008337
id = 136, time = 1408126111.008337
id = 137, time = 1408126119.337340
id = 138, time = 1408126111.007646
id = 139, time = 1408126111.007646
id = 140, time = 1408126111.008337
id = 141, time = 1408126111.008337
id = 142, time = 1408126111.008337
id = 143, time = 1408126111.015979
id = 144, time = 1408126111.000015
id = 145, time = 1408126111.007645
id = 146, time = 1408126111.015983
id = 147, time = 1408126111.000000
id = 148, time = 1408126111.008337
id = 149, time = 1408126111.000000
id = 150, time = 1408126111.000000
id = 151, time = 1408126111.000000
id = 152, time = 1408126111.000015
id = 153, time = 1408126111.000007
id = 154, time = 1408126111.000007
id = 155, time = 1408126111.000000
id = 156, time = 1408126111.000007
id = 157, time = 1408126111.000000
id = 158, time = 1408126111.000007
id = 159, time = 1408126111.000000
id = 160, time = 1408126111.007639
id = 161, time = 1408126111.000000
id = 162, time = 1408126111.000000
id = 163, time = 1408126111.000000
id = 164, time = 1408126111.007643
id = 165, time = 1408126111.007644
id = 166, time = 1408126111.007644
id = 167, time = 1408126111.015980
id = 168, time = 1408126111.008337
id = 169, time = 1408126111.000007
id = 170, time = 1408126111.015983
id = 171, time = 1408126111.000000
id = 172, time = 1408126111.008337
id = 173, time = 1408126111.007640
id = 174, time = 1408126111.000000
id = 175, time = 1408126111.000000
id = 176, time = 1408126111.000000
id = 177, time = 1408126111.008337
id = 178, time = 1408126111.000007
id = 179, time = 1408126111.008337
id = 180, time = 1408126111.000007
id = 181, time = 1408126111.008337
id = 182, time = 1408126111.008337
id = 183, time = 1408126111.000000
id = 184, time = 1408126111.007646
id = 185, time = 1408126111.000000
id = 186, time = 1408126111.000000
id = 187, time = 1408126111.008337
id = 188, time = 1408126111.007646
id = 189, time = 1408126119.337340
id = 190, time = 1408126111.000007
id = 191, time = 1408126111.000000
You can see the time at id 9, 28, 61, 69, 121... the time 1408126119.337340. Obviously, it is wrong. Could you tell me what happened about this time? If possible, could you let me know why the the fuction is so slower on Phi.
My email is Xin.Chen@hermes-microvision.com. I really need you help!
Xin
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles and Jeff,
Thank you for your suggestions and resources referred by you.
My work is to design an imaging system instead of numerical computation library. Of course, the performance on Phi is very important but currently I want to transfer data from host to device as 5 GB per second mentioned in your document.
I designed a system based on this number. However, I cannot reach ten percent of it using all possible methods. It destroys my design and suspends my project becuase data transfer lowers our system. The the perfroamce of the whole system is slower that that of my design.
Let me simplify my question, could you give me a way to reach 5Gb speed of bandwidth of data copy between host and device?
Thank a lot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Xin,
You might want to read through the white papers at: http://www.colfax-intl.com/nd/resources/whitepapers.aspx
One of those may be helpful in aiding you to improve your performance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To answer this question: " Let me simplify my question, could you give me a way to reach 5Gb speed of bandwidth of data copy between host and device?"
1. Align the CPU data on at least 64 bytes
2. Preallocate the data on MIC into which you will do the data transfer
3. Transfer as much data in a single transfer as is feasible.
Using these rules you can achieve >5GiB/s bandwidth for transfer bigger than about 1MB. The attached program is an example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI Rajiv Deodhar,
Thank you for sharing you code. Yes, you are right if we only transfer data. Frankly speaking, I did similar tests before. But when I run some functions on offload mode, it shows a period latency. I think it should be added into data trasfer time. So the bandwidth lowers.
I don't think that I an expert of Phi. But I will try my best to understand Phi. Could you give me a sample code that I can follow instead of a fragment of code? I'm working on a real system instead of writing a paper.
The following is reports
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 116
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 0.031176(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 9437184 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time] 0.000147(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 0 (bytes)
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 117
[Offload] [MIC 0] [Tag] Tag 1
[Offload] [HOST] [Tag 1] [CPU Time] 0.031529(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data] 9437184 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time] 0.000036(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data] 0 (bytes)
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 118
[Offload] [MIC 0] [Tag] Tag 2
[Offload] [HOST] [Tag 2] [CPU Time] 0.042867(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time] 0.000031(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data] 9437184 (bytes)
Transfer data with allocation =0.105715
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 121
[Offload] [MIC 0] [Tag] Tag 3
[Offload] [HOST] [Tag 3] [CPU Time] 0.001436(seconds)
[Offload] [MIC 0] [Tag 3] [CPU->MIC Data] 9437184 (bytes)
[Offload] [MIC 0] [Tag 3] [MIC Time] 0.000000(seconds)
[Offload] [MIC 0] [Tag 3] [MIC->CPU Data] 0 (bytes)
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 122
[Offload] [MIC 0] [Tag] Tag 4
[Offload] [HOST] [Tag 4] [CPU Time] 0.001418(seconds)
[Offload] [MIC 0] [Tag 4] [CPU->MIC Data] 9437184 (bytes)
[Offload] [MIC 0] [Tag 4] [MIC Time] 0.000000(seconds)
[Offload] [MIC 0] [Tag 4] [MIC->CPU Data] 0 (bytes)
Transfer data =0.002895
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 132
[Offload] [MIC 0] [Tag] Tag 5
[Offload] [HOST] [Tag 5] [CPU Time] 1.562599(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data] 8 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time] 1.443725(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data] 8 (bytes)
Runing at 0 =1.562582
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 132
[Offload] [MIC 0] [Tag] Tag 6
[Offload] [HOST] [Tag 6] [CPU Time] 1.282831(seconds)
[Offload] [MIC 0] [Tag 6] [CPU->MIC Data] 8 (bytes)
[Offload] [MIC 0] [Tag 6] [MIC Time] 1.185438(seconds)
[Offload] [MIC 0] [Tag 6] [MIC->CPU Data] 8 (bytes)
Runing at 1 =1.282819
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 132
[Offload] [MIC 0] [Tag] Tag 7
[Offload] [HOST] [Tag 7] [CPU Time] 1.276666(seconds)
[Offload] [MIC 0] [Tag 7] [CPU->MIC Data] 8 (bytes)
[Offload] [MIC 0] [Tag 7] [MIC Time] 1.179740(seconds)
[Offload] [MIC 0] [Tag 7] [MIC->CPU Data] 8 (bytes)
Runing at 2 =1.276653
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 132
[Offload] [MIC 0] [Tag] Tag 8
[Offload] [HOST] [Tag 8] [CPU Time] 1.278277(seconds)
[Offload] [MIC 0] [Tag 8] [CPU->MIC Data] 8 (bytes)
[Offload] [MIC 0] [Tag 8] [MIC Time] 1.181233(seconds)
[Offload] [MIC 0] [Tag 8] [MIC->CPU Data] 8 (bytes)
Runing at 3 =1.278265
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 132
[Offload] [MIC 0] [Tag] Tag 9
[Offload] [HOST] [Tag 9] [CPU Time] 1.272324(seconds)
[Offload] [MIC 0] [Tag 9] [CPU->MIC Data] 8 (bytes)
[Offload] [MIC 0] [Tag 9] [MIC Time] 1.175730(seconds)
[Offload] [MIC 0] [Tag 9] [MIC->CPU Data] 8 (bytes)
Runing at 4 =1.272312
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 171
[Offload] [MIC 0] [Tag] Tag 10
[Offload] [HOST] [Tag 10] [CPU Time] 0.001377(seconds)
[Offload] [MIC 0] [Tag 10] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 10] [MIC Time] 0.000000(seconds)
[Offload] [MIC 0] [Tag 10] [MIC->CPU Data] 9437184 (bytes)
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 174
[Offload] [MIC 0] [Tag] Tag 11
[Offload] [HOST] [Tag 11] [CPU Time] 0.001388(seconds)
[Offload] [MIC 0] [Tag 11] [CPU->MIC Data] 24 (bytes)
[Offload] [MIC 0] [Tag 11] [MIC Time] 0.000073(seconds)
[Offload] [MIC 0] [Tag 11] [MIC->CPU Data] 0 (bytes)
The code is
#include <stdio.h> #include <string.h> #include <math.h> #include <sys/time.h> #include <sys/shm.h> #include <omp.h> #include <stdlib.h> #ifdef OFFLOAD #include <offload.h> #endif #if WIN32 #define ALIGN(x) __declspec(align(x)) #else #define ALIGN(x) __attribute__ ((aligned (x))) #endif #define LOOPNUM 200 //CPU buffer __declspec(target(mic)) char *imgDev1; __declspec(target(mic)) static char *imgDev2; __declspec(target(mic)) static char *imgOut; /* buffer alignment */ static int align = 2*1024*1024; #ifdef OFFLOAD __attribute__((target (mic))) double When() { #ifndef _WIN32 static struct timeval tp; gettimeofday(&tp, NULL); double t = (double)tp.tv_sec; double t1 = (double) tp.tv_usec; return (t + t1 * 1e-6); #else clock_t start = clock( ); double duration = (double)start / CLOCKS_PER_SEC; return duration; #endif } #endif double WhenCPU() { #ifndef _WIN32 static struct timeval tp; gettimeofday(&tp, NULL); double t = (double)tp.tv_sec; double t1 = (double) tp.tv_usec; return (t + t1 * 1e-6); #else clock_t start = clock( ); double duration = (double)start / CLOCKS_PER_SEC; return duration; #endif } void ImageAdd(unsigned char *img1, unsigned char *img2, unsigned char* outImage, unsigned int width, unsigned int height) { //OpenMP test part double t0 = WhenCPU(); const size_t iCPUNum = omp_get_max_threads(); const size_t ySegment = height/iCPUNum; printf("LoopNum = %d\n", LOOPNUM); #pragma omp parallel for for (size_t n = 0; n < iCPUNum; n++) { const size_t starty = n * width; size_t endy = starty + ySegment; if(n = (iCPUNum -1)) endy = height; for (size_t y = starty; y < endy; y++) { for (size_t nn = 0; nn < LOOPNUM; nn++) { for (size_t x = 0; x < width; x++) { outImage[y*width + x] = img1[y*width+x]*0.5f + img2[y*width+x]*0.5f; } } } }//end of n<iCPUNum double t1 = WhenCPU(); printf("OpenMP duration time at CPU =%f\n", t1-t0); unsigned int dataSize= width * height * sizeof(char); imgDev1 = (char*)_mm_malloc(dataSize+align, align); if (imgDev1 ==NULL) { printf("Cannot open imgDev1 memory\n"); abort(); } imgDev2 = (char*)_mm_malloc(dataSize+align, align); if (imgDev2 ==NULL) { printf("Cannot open imgDev2 memory\n"); abort(); } imgOut = (char*)_mm_malloc(dataSize+align, align); if (imgOut ==NULL) { printf("Cannot open imgDev2 memory\n"); abort(); } memcpy (imgDev1, img1, dataSize); memcpy(imgDev2, img2, dataSize); double t2 = WhenCPU(); #pragma offload_transfer target(mic:0) in(imgDev1:length(dataSize) alloc_if(1) free_if(0)) #pragma offload_transfer target(mic:0) in(imgDev2:length(dataSize) alloc_if(1) free_if(0)) #pragma offload_transfer target(mic:0) out(imgOut:length(dataSize) alloc_if(1) free_if(0)) double t3 = WhenCPU(); printf("Transfer data with allocation =%f\n", t3-t2); #pragma offload_transfer target(mic:0) in(imgDev1:length(dataSize) alloc_if(0) free_if(0)) #pragma offload_transfer target(mic:0) in(imgDev2:length(dataSize) alloc_if(0) free_if(0)) double t4 = WhenCPU(); printf("Transfer data =%f\n", t4-t3); //phi test part for (size_t n = 0; n < 5; n++) { double t40 = WhenCPU(); #pragma offload target(mic:0) nocopy(imgDev1,imgDev2, imgOut:length(dataSize) alloc_if(0) free_if(0)) { const size_t iCPUNum = omp_get_max_threads(); const size_t ySegment = height/iCPUNum; #pragma omp parallel for for (size_t n = 0; n < iCPUNum; n++) { const size_t starty = n * width; size_t endy = starty + ySegment; if(n = (iCPUNum -1)) endy = height; unsigned char tmpArray1[width]; unsigned char tmpArray2[width]; unsigned char tmpArrayout[width]; for (size_t y = starty; y < endy; y++) { memcpy(tmpArray1, &imgDev1[y*width], width*sizeof(char)); memcpy(tmpArray2, &imgDev2[y*width], width*sizeof(char)); for (size_t nn = 0; nn < LOOPNUM; nn++) { for (size_t x = 0; x < width; x++) { //' imgOut[y*width + x] = imgDev1[y*width+x]*0.5f + imgDev2[y*width+x]*0.5f; tmpArrayout[ x] = tmpArray1*0.5f + tmpArray2 *0.5f; } } memcpy( &imgOut[y*width], tmpArrayout,width*sizeof(char)); } }//end of n<iCPUNum } double t50 = WhenCPU(); printf("Runing at %d =%f\n", n, t50-t40); } //printf("Runing at %d =%f\n", n, t5-t4); #pragma offload_transfer target(mic:0) out(imgOut:length(dataSize) alloc_if(0) free_if(0)) memcpy(outImage, imgOut, dataSize); #pragma offload_transfer target(mic:0) nocopy(imgDev1,imgDev2, imgOut:length(dataSize) alloc_if(0) free_if(1)) if (imgDev1 != NULL) _mm_free(imgDev1); if (imgDev2 != NULL) _mm_free(imgDev2); if (imgOut != NULL) _mm_free(imgOut); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Rajiv Deodhar : thanks for the sample program; I ran it on two different boxes and got very different results:
HostA: xeon E5 2695 v2 w/ Xeon Phi 7100 : I consistently get 6+ GB/s for both send and receive (for buffers that are large enough)
HostB: xeon E5 2620 w/ Xeon Phi 5100: for send I get a consisten 6+ GB/s, just like HostA. However, for 'receive' I get very different results when I am using the env var MIC_USE_2MB_BUFFERS=2 or when I am not using that env var. With the MIC_USE_2MB_BUFFERS env var set I get 6+ GB/s receive speed as well. However , if it is NOT set then the receive speed varies wildly:
Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10. Size(Bytes) Send(GiB/sec) Receive(GiB/sec) 1024 0.12 0.14 2048 0.24 0.30 4096 0.44 0.59 8192 0.85 1.07 16384 1.50 1.79 32768 2.34 2.83 65536 3.49 3.84 131072 4.39 4.88 262144 5.21 5.59 524288 5.72 6.00 1048576 5.99 6.08 1048576 6.04 6.09 2097152 5.91 5.99 4194304 6.14 6.20 8388608 5.95 1.59 16777216 6.10 1.02 33554432 6.24 0.91 67108864 6.31 0.66 134217728 6.36 0.51 268435456 6.39 0.37 536870912 6.40 0.36 1073741824 6.41 0.37
now this is a nice reminder for me to always set this magic env var, but what I do not understand is why the Xeon Phi 7100 is not affected by this. The software environments are 99% identical (EL 6 clone, mpss 3.3.2 software stack).
Hopefully you can shed some light on this :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jan,
Thank you for your comments and kind reminder. I have set the ENV.
Runing at 3 =2.294204
[Offload] [MIC 0] [File] imageaddtrans.cpp
[Offload] [MIC 0] [Line] 133
[Offload] [MIC 0] [Tag] Tag 9
[Offload] [HOST] [Tag 9] [State] Start Offload
[Offload] [HOST] [Tag 9] [State] Initialize function __offload_entry_imageaddtrans_cpp_133ImageAdd_6aa841318d398d4e5c43322e08aeba3dicpc54729644sB3YKY
[Offload] [HOST] [Tag 9] [State] Send pointer data
[Offload] [HOST] [Tag 9] [State] CPU->MIC pointer data 0
[Offload] [HOST] [Tag 9] [State] Gather copyin data
[Offload] [HOST] [Tag 9] [State] CPU->MIC copyin data 8
[Offload] [HOST] [Tag 9] [State] Compute task on MIC
[Offload] [HOST] [Tag 9] [State] Receive pointer data
[Offload] [HOST] [Tag 9] [State] MIC->CPU pointer data 0
[Offload] [MIC 0] [Tag 8] [State] MIC->CPU copyout data 0
[Offload] [MIC 0] [Tag 9] [State] Start target function __offload_entry_imageaddtrans_cpp_133ImageAdd_6aa841318d398d4e5c43322e08aeba3dicpc54729644sB3YKY
[Offload] [MIC 0] [Tag 9] [Var] imgDev1_V$2 NOCOPY
[Offload] [MIC 0] [Tag 9] [Var] imgDev2_V$3 NOCOPY
[Offload] [MIC 0] [Tag 9] [Var] imgOut_V$4 NOCOPY
[Offload] [MIC 0] [Tag 9] [Var] height_1208_V$f IN
[Offload] [MIC 0] [Tag 9] [Var] width_1208_V$e IN
[Offload] [MIC 0] [Tag 9] [State] Scatter copyin data
offload
number of cores = 240
[Offload] [HOST] [Tag 9] [State] Scatter copyout data
[Offload] [HOST] [Tag 9] [CPU Time] 2.299723(seconds)
[Offload] [MIC 0] [Tag 9] [CPU->MIC Data] 8 (bytes)
[Offload] [MIC 0] [Tag 9] [MIC Time] 2.125183(seconds)
[Offload] [MIC 0] [Tag 9] [MIC->CPU Data] 0 (bytes)
The problem is still here. THe CPU time is 2.299723, and the MIC time 2.125283. Why the different, latency tim eis 0.178 sec. You can see that for this part, no data transfer.
Thank you again!
Xin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jan,
May I ask what version of the compiler you used on your machine called HostB, the machine that you got not so good results? Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi loc-nguyen,
host B has the Intel compiler suite 2015 installed. I'm turning this into a new thread, as it is getting off-topic for this original thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jan Just,
Please post the URL to your new forum post so that others reading this post can find the new thread.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Xin,
Are you still in need of help? If so, please continue your thread.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »