Profermace Problem of Phi - Page 2

Chen__Xin · ‎10-15-2014

Hi Charles Congdon,

I am using Phi Coprocessor to doing a project. When I ran this function in CPU, 24 cores, the time is about 8.5 msec. But It ran about 700 msec pn Phi. In order to figure it our, I print the time at a begin of openmp part of offload code:

#pragma omp parallel for
for( size_t y = 0; y < 192; y++)
{

tmp[omp_get_thread_num()] = When();

__attribute__((target(mic))) double When()
{
#ifndef _WIN32
static struct timeval tp;
gettimeofday(&tp, NULL);
double t = (double)tp.tv_sec;
double t1 = (double) tp.tv_usec;
return (t + t1 * 1e-6);
#else
clock_t start = clock( );
double duration = (double)start / CLOCKS_PER_SEC;
return duration;
#endif
}

The results of time are:

id = 1, time = 1408126111.008337

id = 2, time = 1408126111.008337

id = 3, time = 1408126111.000000

id = 4, time = 1408126111.007646

id = 5, time = 1408126111.000000

id = 6, time = 1408126111.000000

id = 7, time = 1408126111.008337

id = 8, time = 1408126111.007646

id = 9, time = 1408126119.337340

id = 10, time = 1408126111.000007

id = 11, time = 1408126111.000000

id = 12, time = 1408126119.337340

id = 13, time = 1408126111.000007

id = 14, time = 1408126111.000007

id = 15, time = 1408126111.000000

id = 16, time = 1408126119.337340

id = 17, time = 1408126111.008337

id = 18, time = 1408126111.000007

id = 19, time = 1408126111.000007

id = 20, time = 1408126111.008337

id = 21, time = 1408126111.008337

id = 22, time = 1408126111.008337

id = 23, time = 1408126111.015979

id = 24, time = 1408126111.000015

id = 25, time = 1408126111.007645

id = 26, time = 1408126111.015983

id = 27, time = 1408126111.000000

id = 28, time = 1408126119.337340

id = 29, time = 1408126111.000000

id = 30, time = 1408126111.007644

id = 31, time = 1408126111.008337

id = 32, time = 1408126111.015964

id = 33, time = 1408126111.000007

id = 34, time = 1408126111.000007

id = 35, time = 1408126111.000000

id = 36, time = 1408126111.000007

id = 37, time = 1408126111.000000

id = 38, time = 1408126111.000000

id = 39, time = 1408126111.000000

id = 40, time = 1408126111.007639

id = 41, time = 1408126111.000000

id = 42, time = 1408126111.000000

id = 43, time = 1408126111.008337

id = 44, time = 1408126111.007643

id = 45, time = 1408126111.007644

id = 46, time = 1408126111.007644

id = 47, time = 1408126111.015980

id = 48, time = 1408126111.008337

id = 49, time = 1408126111.000000

id = 50, time = 1408126111.015983

id = 51, time = 1408126111.000007

id = 52, time = 1408126119.337340

id = 53, time = 1408126111.007640

id = 54, time = 1408126111.000000

id = 55, time = 1408126111.000000

id = 56, time = 1408126111.000000

id = 57, time = 1408126111.008337

id = 58, time = 1408126111.000007

id = 59, time = 1408126111.008337

id = 60, time = 1408126111.015960

id = 61, time = 1408126119.337340

id = 62, time = 1408126111.008337

id = 63, time = 1408126111.000000

id = 64, time = 1408126111.007646

id = 65, time = 1408126111.000000

id = 66, time = 1408126111.000000

id = 67, time = 1408126111.008337

id = 68, time = 1408126111.007646

id = 69, time = 1408126119.337340

id = 70, time = 1408126111.000007

id = 71, time = 1408126111.000000

id = 72, time = 1408126111.008337

id = 73, time = 1408126111.000007

id = 74, time = 1408126111.000000

id = 75, time = 1408126111.008337

id = 76, time = 1408126111.008337

id = 77, time = 1408126111.008337

id = 78, time = 1408126111.000007

id = 79, time = 1408126111.000007

id = 80, time = 1408126119.337340

id = 81, time = 1408126119.337340

id = 82, time = 1408126111.008337

id = 83, time = 1408126111.015979

id = 84, time = 1408126111.015964

id = 85, time = 1408126111.007645

id = 86, time = 1408126111.015983

id = 87, time = 1408126111.000000

id = 88, time = 1408126111.008337

id = 89, time = 1408126111.000007

id = 90, time = 1408126111.000000

id = 91, time = 1408126111.008337

id = 92, time = 1408126111.000015

id = 93, time = 1408126111.007646

id = 94, time = 1408126111.000007

id = 95, time = 1408126111.000000

id = 96, time = 1408126111.007646

id = 97, time = 1408126111.000000

id = 98, time = 1408126111.000000

id = 99, time = 1408126111.000000

id = 100, time = 1408126111.007639

id = 101, time = 1408126111.000000

id = 102, time = 1408126111.000000

id = 103, time = 1408126111.008337

id = 104, time = 1408126111.007643

id = 105, time = 1408126111.007644

id = 106, time = 1408126111.007644

id = 107, time = 1408126111.015980

id = 108, time = 1408126119.337340

id = 109, time = 1408126111.000000

id = 110, time = 1408126111.015983

id = 111, time = 1408126111.000007

id = 112, time = 1408126111.008337

id = 113, time = 1408126111.007640

id = 114, time = 1408126111.000000

id = 115, time = 1408126111.000000

id = 116, time = 1408126111.000000

id = 117, time = 1408126111.008337

id = 118, time = 1408126111.000007

id = 119, time = 1408126111.008337

id = 120, time = 1408126119.337340

id = 121, time = 1408126119.337340

id = 122, time = 1408126111.008337

id = 123, time = 1408126111.000000

id = 124, time = 1408126111.007646

id = 125, time = 1408126111.000000

id = 126, time = 1408126111.000000

id = 127, time = 1408126111.008337

id = 128, time = 1408126111.007646

id = 129, time = 1408126119.337340

id = 130, time = 1408126111.000007

id = 131, time = 1408126111.000000

id = 132, time = 1408126111.008337

id = 133, time = 1408126111.000007

id = 134, time = 1408126111.000007

id = 135, time = 1408126111.008337

id = 136, time = 1408126111.008337

id = 137, time = 1408126119.337340

id = 138, time = 1408126111.007646

id = 139, time = 1408126111.007646

id = 140, time = 1408126111.008337

id = 141, time = 1408126111.008337

id = 142, time = 1408126111.008337

id = 143, time = 1408126111.015979

id = 144, time = 1408126111.000015

id = 145, time = 1408126111.007645

id = 146, time = 1408126111.015983

id = 147, time = 1408126111.000000

id = 148, time = 1408126111.008337

id = 149, time = 1408126111.000000

id = 150, time = 1408126111.000000

id = 151, time = 1408126111.000000

id = 152, time = 1408126111.000015

id = 153, time = 1408126111.000007

id = 154, time = 1408126111.000007

id = 155, time = 1408126111.000000

id = 156, time = 1408126111.000007

id = 157, time = 1408126111.000000

id = 158, time = 1408126111.000007

id = 159, time = 1408126111.000000

id = 160, time = 1408126111.007639

id = 161, time = 1408126111.000000

id = 162, time = 1408126111.000000

id = 163, time = 1408126111.000000

id = 164, time = 1408126111.007643

id = 165, time = 1408126111.007644

id = 166, time = 1408126111.007644

id = 167, time = 1408126111.015980

id = 168, time = 1408126111.008337

id = 169, time = 1408126111.000007

id = 170, time = 1408126111.015983

id = 171, time = 1408126111.000000

id = 172, time = 1408126111.008337

id = 173, time = 1408126111.007640

id = 174, time = 1408126111.000000

id = 175, time = 1408126111.000000

id = 176, time = 1408126111.000000

id = 177, time = 1408126111.008337

id = 178, time = 1408126111.000007

id = 179, time = 1408126111.008337

id = 180, time = 1408126111.000007

id = 181, time = 1408126111.008337

id = 182, time = 1408126111.008337

id = 183, time = 1408126111.000000

id = 184, time = 1408126111.007646

id = 185, time = 1408126111.000000

id = 186, time = 1408126111.000000

id = 187, time = 1408126111.008337

id = 188, time = 1408126111.007646

id = 189, time = 1408126119.337340

id = 190, time = 1408126111.000007

id = 191, time = 1408126111.000000

You can see the time at id 9, 28, 61, 69, 121... the time 1408126119.337340. Obviously, it is wrong. Could you tell me what happened about this time? If possible, could you let me know why the the fuction is so slower on Phi.

My email is Xin.Chen@hermes-microvision.com. I really need you help!

Xin

Chen__Xin · ‎10-28-2014

Hi Charles and Jeff,

Thank you for your suggestions and resources referred by you.

My work is to design an imaging system instead of numerical computation library. Of course, the performance on Phi is very important but currently I want to transfer data from host to device as 5 GB per second mentioned in your document.

I designed a system based on this number. However, I cannot reach ten percent of it using all possible methods. It destroys my design and suspends my project becuase data transfer lowers our system. The the perfroamce of the whole system is slower that that of my design.

Let me simplify my question, could you give me a way to reach 5Gb speed of bandwidth of data copy between host and device?

Thank a lot.

jimdempseyatthecove · ‎10-28-2014

Xin,

You might want to read through the white papers at: http://www.colfax-intl.com/nd/resources/whitepapers.aspx

One of those may be helpful in aiding you to improve your performance.

Jim Dempsey

Rajiv_D_Intel · ‎10-28-2014

To answer this question: " Let me simplify my question, could you give me a way to reach 5Gb speed of bandwidth of data copy between host and device?"

1. Align the CPU data on at least 64 bytes

2. Preallocate the data on MIC into which you will do the data transfer

3. Transfer as much data in a single transfer as is feasible.

Using these rules you can achieve >5GiB/s bandwidth for transfer bigger than about 1MB. The attached program is an example.

Chen__Xin · ‎10-29-2014

HI Rajiv Deodhar,

Thank you for sharing you code. Yes, you are right if we only transfer data. Frankly speaking, I did similar tests before. But when I run some functions on offload mode, it shows a period latency. I think it should be added into data trasfer time. So the bandwidth lowers.

I don't think that I an expert of Phi. But I will try my best to understand Phi. Could you give me a sample code that I can follow instead of a fragment of code? I'm working on a real system instead of writing a paper.

The following is reports

[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            116
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST] [Tag 0] [CPU Time]        0.031176(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   9437184 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.000147(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   0 (bytes)

[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            117
[Offload] [MIC 0] [Tag]             Tag 1
[Offload] [HOST] [Tag 1] [CPU Time]        0.031529(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   9437184 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        0.000036(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   0 (bytes)

[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            118
[Offload] [MIC 0] [Tag]             Tag 2
[Offload] [HOST] [Tag 2] [CPU Time]        0.042867(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time]        0.000031(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data]   9437184 (bytes)

Transfer data with allocation =0.105715
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            121
[Offload] [MIC 0] [Tag]             Tag 3
[Offload] [HOST] [Tag 3] [CPU Time]        0.001436(seconds)
[Offload] [MIC 0] [Tag 3] [CPU->MIC Data]   9437184 (bytes)
[Offload] [MIC 0] [Tag 3] [MIC Time]        0.000000(seconds)
[Offload] [MIC 0] [Tag 3] [MIC->CPU Data]   0 (bytes)

[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            122
[Offload] [MIC 0] [Tag]             Tag 4
[Offload] [HOST] [Tag 4] [CPU Time]        0.001418(seconds)
[Offload] [MIC 0] [Tag 4] [CPU->MIC Data]   9437184 (bytes)
[Offload] [MIC 0] [Tag 4] [MIC Time]        0.000000(seconds)
[Offload] [MIC 0] [Tag 4] [MIC->CPU Data]   0 (bytes)

Transfer data =0.002895
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            132
[Offload] [MIC 0] [Tag]             Tag 5
[Offload] [HOST] [Tag 5] [CPU Time]        1.562599(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data]   8 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time]        1.443725(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data]   8 (bytes)

Runing at 0 =1.562582
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            132
[Offload] [MIC 0] [Tag]             Tag 6
[Offload] [HOST] [Tag 6] [CPU Time]        1.282831(seconds)
[Offload] [MIC 0] [Tag 6] [CPU->MIC Data]   8 (bytes)
[Offload] [MIC 0] [Tag 6] [MIC Time]        1.185438(seconds)
[Offload] [MIC 0] [Tag 6] [MIC->CPU Data]   8 (bytes)

Runing at 1 =1.282819
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            132
[Offload] [MIC 0] [Tag]             Tag 7
[Offload] [HOST] [Tag 7] [CPU Time]        1.276666(seconds)
[Offload] [MIC 0] [Tag 7] [CPU->MIC Data]   8 (bytes)
[Offload] [MIC 0] [Tag 7] [MIC Time]        1.179740(seconds)
[Offload] [MIC 0] [Tag 7] [MIC->CPU Data]   8 (bytes)

Runing at 2 =1.276653
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            132
[Offload] [MIC 0] [Tag]             Tag 8
[Offload] [HOST] [Tag 8] [CPU Time]        1.278277(seconds)
[Offload] [MIC 0] [Tag 8] [CPU->MIC Data]   8 (bytes)
[Offload] [MIC 0] [Tag 8] [MIC Time]        1.181233(seconds)
[Offload] [MIC 0] [Tag 8] [MIC->CPU Data]   8 (bytes)

Runing at 3 =1.278265
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            132
[Offload] [MIC 0] [Tag]             Tag 9
[Offload] [HOST] [Tag 9] [CPU Time]        1.272324(seconds)
[Offload] [MIC 0] [Tag 9] [CPU->MIC Data]   8 (bytes)
[Offload] [MIC 0] [Tag 9] [MIC Time]        1.175730(seconds)
[Offload] [MIC 0] [Tag 9] [MIC->CPU Data]   8 (bytes)

Runing at 4 =1.272312
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            171
[Offload] [MIC 0] [Tag]             Tag 10
[Offload] [HOST] [Tag 10] [CPU Time]        0.001377(seconds)
[Offload] [MIC 0] [Tag 10] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 10] [MIC Time]        0.000000(seconds)
[Offload] [MIC 0] [Tag 10] [MIC->CPU Data]   9437184 (bytes)

[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            174
[Offload] [MIC 0] [Tag]             Tag 11
[Offload] [HOST] [Tag 11] [CPU Time]        0.001388(seconds)
[Offload] [MIC 0] [Tag 11] [CPU->MIC Data]   24 (bytes)
[Offload] [MIC 0] [Tag 11] [MIC Time]        0.000073(seconds)
[Offload] [MIC 0] [Tag 11] [MIC->CPU Data]   0 (bytes)

The code is

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>
#include <sys/shm.h>
#include <omp.h>
#include <stdlib.h>


#ifdef OFFLOAD
#include <offload.h>
#endif

#if WIN32
#define ALIGN(x) __declspec(align(x))
#else
#define ALIGN(x) __attribute__ ((aligned (x)))
#endif

#define LOOPNUM 200
//CPU buffer
__declspec(target(mic))
 char *imgDev1;
__declspec(target(mic))
static char *imgDev2;
__declspec(target(mic))
static char *imgOut;

/* buffer alignment */
static int align = 2*1024*1024;


#ifdef OFFLOAD
__attribute__((target (mic))) double When()
{
#ifndef _WIN32
 static struct timeval tp;
 gettimeofday(&tp, NULL);
 double t = (double)tp.tv_sec;
 double t1 = (double) tp.tv_usec;
 return (t + t1 * 1e-6);
#else
 clock_t start = clock( );
 double duration = (double)start / CLOCKS_PER_SEC;
 return duration;
#endif
}
#endif

double WhenCPU()
{
#ifndef _WIN32
 static struct timeval tp;
 gettimeofday(&tp, NULL);
 double t = (double)tp.tv_sec;
 double t1 = (double) tp.tv_usec;
 return (t + t1 * 1e-6);
#else
 clock_t start = clock( );
 double duration = (double)start / CLOCKS_PER_SEC;
 return duration;
#endif
}

void ImageAdd(unsigned char *img1, unsigned char *img2, unsigned char* outImage, unsigned int width, unsigned int height)
{
  //OpenMP test part
  double t0 = WhenCPU();
    const size_t iCPUNum = omp_get_max_threads();
    const size_t ySegment = height/iCPUNum;
    printf("LoopNum = %d\n", LOOPNUM);
#pragma omp parallel for
    for (size_t n = 0; n < iCPUNum; n++)
      {
 const size_t starty = n * width;
 size_t endy = starty + ySegment;
 if(n = (iCPUNum -1))  endy = height;
 
 for (size_t y = starty; y < endy; y++)
   {
     for (size_t nn = 0; nn < LOOPNUM; nn++)
       {
  for (size_t x = 0; x < width; x++)
    {
      outImage[y*width + x] = img1[y*width+x]*0.5f + img2[y*width+x]*0.5f;
    }
       }
   }
      }//end of n<iCPUNum
    double t1 = WhenCPU();
    printf("OpenMP duration time at CPU  =%f\n", t1-t0);
    unsigned int dataSize= width * height * sizeof(char);
    imgDev1 = (char*)_mm_malloc(dataSize+align, align);
    if (imgDev1 ==NULL)
      {
 printf("Cannot open imgDev1 memory\n");
 abort();
      }

    imgDev2 = (char*)_mm_malloc(dataSize+align, align);
    if (imgDev2 ==NULL)
      {
 printf("Cannot open imgDev2 memory\n");
 abort();
      }

    imgOut = (char*)_mm_malloc(dataSize+align, align);
    if (imgOut ==NULL)
      {
 printf("Cannot open imgDev2 memory\n");
 abort();
      }
    memcpy (imgDev1, img1, dataSize);
    memcpy(imgDev2, img2, dataSize);
    double t2 = WhenCPU();
#pragma offload_transfer target(mic:0) in(imgDev1:length(dataSize) alloc_if(1) free_if(0))
#pragma offload_transfer target(mic:0) in(imgDev2:length(dataSize) alloc_if(1) free_if(0))
#pragma offload_transfer target(mic:0) out(imgOut:length(dataSize) alloc_if(1) free_if(0))
    double t3 = WhenCPU();
    printf("Transfer data  with allocation =%f\n", t3-t2);
#pragma offload_transfer target(mic:0) in(imgDev1:length(dataSize) alloc_if(0) free_if(0))
#pragma offload_transfer target(mic:0) in(imgDev2:length(dataSize) alloc_if(0) free_if(0))
    double t4 = WhenCPU();
  printf("Transfer data  =%f\n", t4-t3);
 
   
    //phi test part
  for (size_t n = 0; n < 5; n++)
    {
 double t40 = WhenCPU();
 
#pragma offload target(mic:0) nocopy(imgDev1,imgDev2, imgOut:length(dataSize) alloc_if(0) free_if(0))
  {
     
    const size_t iCPUNum = omp_get_max_threads();
    const size_t ySegment = height/iCPUNum;
#pragma omp parallel for
    for (size_t n = 0; n < iCPUNum; n++)
      {
 const size_t starty = n * width;
 size_t endy = starty + ySegment;
 if(n = (iCPUNum -1))  endy = height;
 unsigned char  tmpArray1[width];
 unsigned char  tmpArray2[width];
 unsigned char  tmpArrayout[width];
 
 for (size_t y = starty; y < endy; y++)
   {
     memcpy(tmpArray1, &imgDev1[y*width], width*sizeof(char));
     memcpy(tmpArray2, &imgDev2[y*width], width*sizeof(char));

     for (size_t nn = 0; nn < LOOPNUM; nn++)
       {
  for (size_t x = 0; x < width; x++)
    {
      //' imgOut[y*width + x] = imgDev1[y*width+x]*0.5f + imgDev2[y*width+x]*0.5f;
      tmpArrayout[ x] = tmpArray1*0.5f + tmpArray2*0.5f;

    }
       }
     memcpy( &imgOut[y*width], tmpArrayout,width*sizeof(char));

   }
      }//end of n<iCPUNum

  }
  double t50 = WhenCPU();
  printf("Runing  at %d  =%f\n", n, t50-t40);
    }
  //printf("Runing  at %d  =%f\n", n, t5-t4);
#pragma offload_transfer target(mic:0) out(imgOut:length(dataSize) alloc_if(0) free_if(0))
  memcpy(outImage, imgOut, dataSize);

#pragma offload_transfer target(mic:0) nocopy(imgDev1,imgDev2, imgOut:length(dataSize) alloc_if(0) free_if(1))
    if (imgDev1 != NULL)
      _mm_free(imgDev1);  

    if (imgDev2 != NULL)
      _mm_free(imgDev2);  

    if (imgOut != NULL)
      _mm_free(imgOut);  

}

JJK · ‎10-31-2014

@Rajiv Deodhar : thanks for the sample program; I ran it on two different boxes and got very different results:

HostA: xeon E5 2695 v2 w/ Xeon Phi 7100 : I consistently get 6+ GB/s for both send and receive (for buffers that are large enough)

HostB: xeon E5 2620 w/ Xeon Phi 5100: for send I get a consisten 6+ GB/s, just like HostA. However, for 'receive' I get very different results when I am using the env var MIC_USE_2MB_BUFFERS=2 or when I am not using that env var. With the MIC_USE_2MB_BUFFERS env var set I get 6+ GB/s receive speed as well. However , if it is NOT set then the receive speed varies wildly:

Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10.

         Size(Bytes) Send(GiB/sec) Receive(GiB/sec)
                1024     0.12             0.14
                2048     0.24             0.30
                4096     0.44             0.59
                8192     0.85             1.07
               16384     1.50             1.79
               32768     2.34             2.83
               65536     3.49             3.84
              131072     4.39             4.88
              262144     5.21             5.59
              524288     5.72             6.00
             1048576     5.99             6.08
             1048576     6.04             6.09
             2097152     5.91             5.99
             4194304     6.14             6.20
             8388608     5.95             1.59
            16777216     6.10             1.02
            33554432     6.24             0.91
            67108864     6.31             0.66
           134217728     6.36             0.51
           268435456     6.39             0.37
           536870912     6.40             0.36
          1073741824     6.41             0.37

now this is a nice reminder for me to always set this magic env var, but what I do not understand is why the Xeon Phi 7100 is not affected by this. The software environments are 99% identical (EL 6 clone, mpss 3.3.2 software stack).

Hopefully you can shed some light on this :)

Chen__Xin · ‎10-31-2014

Hi Jan,

Thank you for your comments and kind reminder. I have set the ENV.

Runing at 3 =2.294204
[Offload] [MIC 0] [File]            imageaddtrans.cpp
[Offload] [MIC 0] [Line]            133
[Offload] [MIC 0] [Tag]             Tag 9
[Offload] [HOST] [Tag 9] [State]   Start Offload
[Offload] [HOST] [Tag 9] [State]   Initialize function __offload_entry_imageaddtrans_cpp_133ImageAdd_6aa841318d398d4e5c43322e08aeba3dicpc54729644sB3YKY
[Offload] [HOST] [Tag 9] [State]   Send pointer data
[Offload] [HOST] [Tag 9] [State]   CPU->MIC pointer data 0
[Offload] [HOST] [Tag 9] [State]   Gather copyin data
[Offload] [HOST] [Tag 9] [State]   CPU->MIC copyin data 8
[Offload] [HOST] [Tag 9] [State]   Compute task on MIC
[Offload] [HOST] [Tag 9] [State]   Receive pointer data
[Offload] [HOST] [Tag 9] [State]   MIC->CPU pointer data 0
[Offload] [MIC 0] [Tag 8] [State]   MIC->CPU copyout data   0
[Offload] [MIC 0] [Tag 9] [State]   Start target function __offload_entry_imageaddtrans_cpp_133ImageAdd_6aa841318d398d4e5c43322e08aeba3dicpc54729644sB3YKY
[Offload] [MIC 0] [Tag 9] [Var]     imgDev1_V$2 NOCOPY
[Offload] [MIC 0] [Tag 9] [Var]     imgDev2_V$3 NOCOPY
[Offload] [MIC 0] [Tag 9] [Var]     imgOut_V$4 NOCOPY
[Offload] [MIC 0] [Tag 9] [Var]     height_1208_V$f IN
[Offload] [MIC 0] [Tag 9] [Var]     width_1208_V$e IN
[Offload] [MIC 0] [Tag 9] [State]   Scatter copyin data
offload
number of cores = 240
[Offload] [HOST] [Tag 9] [State]   Scatter copyout data
[Offload] [HOST] [Tag 9] [CPU Time]        2.299723(seconds)
[Offload] [MIC 0] [Tag 9] [CPU->MIC Data]   8 (bytes)
[Offload] [MIC 0] [Tag 9] [MIC Time]        2.125183(seconds)
[Offload] [MIC 0] [Tag 9] [MIC->CPU Data]   0 (bytes)

The problem is still here. THe CPU time is 2.299723, and the MIC time 2.125283. Why the different, latency tim eis 0.178 sec. You can see that for this part, no data transfer.

Thank you again!

Xin

Loc_N_Intel · ‎10-31-2014

Hi Jan,

May I ask what version of the compiler you used on your machine called HostB, the machine that you got not so good results? Thank you.

JJK · ‎11-03-2014

Hi loc-nguyen,

host B has the Intel compiler suite 2015 installed. I'm turning this into a new thread, as it is getting off-topic for this original thread.

TaylorIoTKidd · ‎11-04-2014

Jan Just,

Please post the URL to your new forum post so that others reading this post can find the new thread.

Regards
--
Taylor

TaylorIoTKidd · ‎11-04-2014

Xin,

Are you still in need of help? If so, please continue your thread.

Regards
--
Taylor

JJK · ‎11-04-2014

Hi Taylor,

sure, the new thread is https://software.intel.com/en-us/forums/topic/534808