Software Archive
Read-only legacy content
17061 Discussions

Problems with Xeon Phi data transfers

lilae
Beginner
1,002 Views

Hello everyone,

I am new to Xeon Phi coproccessors and I am now trying to adapt a program previously made for only CPU to offload to the Xeon Phi. While doing this I have found some strange results with the time neccessary for the transfer of data to the coprocessor. I am using simple offloading pragamas with in clauses to transfer an array of floats to the device. For example a data transfer of 2GB lasts for  3.08 seconds (677 MB/s) if we transfer it from CPU to the device and if we do it the other way around it lasts 0.315 s(6499 MB/s). We are using a PCIe 2.0 x16, so the theoretical bandwidth would be of 8GB/s. In the case of getting data from the device we almost get and ideal bandwidth but in the case of introducing data the bandwidth is not so good. I am thinking that maybe the channels of the bus are divided and only 2 lanes are dedicated to the host-device transfers and the reamining ones are dedicated to device-host transfers. 

Also to try to overcome this problem, since the main array at first is empty, we decided to not transfer the array to the device but to create it directly on the device and at the end return it to the host. I have tried it in a number of ways using the offloading pragmas (with the into keyword) but always obtain the same error telling me that it cannot find the data associated with a pointer. I need the pointer to this dynamic memory region allocated in the MIC to be global since it is shared between different offloading calls. I'm wondering if anyone has experienced and successfully overcame this problem.

Thanks.

0 Kudos
8 Replies
jimdempseyatthecove
Honored Contributor III
1,002 Views

If the main array is empty, then your application has not "touched"  this memory, and thus the pages in which it lies has not been mapped.

To verify this (sketch)

[cpp]
const int ArraySize = 5000000;
float* Array = (float*)malloc(ArraySize*sizeof(float));
for(int I=0; I < ArraySize; ++ I)
  Array = 0.0; // touch as well as zero
...
double tBegin = omp_get_wtime();
YourFunctionWithOffloadToPhiHere();
double tEnd;
double tElapse = tEnd - tBegin;
[/cpp]

Do not attribute virtual memory mapping to page file to offload speed.

Jim Dempsey

0 Kudos
Ravi_N_Intel
Employee
1,002 Views

Your transfer of data from host to card includes the setup of the memory.  The transfer from card to host may not include the setup cost depending on how you used the pragma.

Try creating the memory in a seperate offload pragma before you measure the host-> mic transfer by doing the following
#pragma offload_transfer target(mic)  nocopy(A:lenght(size) alloc_if(1) free_if(0) 
 and use the created memory in the pragma offload you are measuring with in(A:lenght(size)  alloc_if(0) ) clause.

0 Kudos
lilae
Beginner
1,002 Views

Thank you for your answers. In the end, yes, the problem was the setup of the memory. Why does it cost that much? Apart from allocating memory in the device what does this setup mean? We had measudred the cost of allocating memory in the device and it didn't seem so costly.

Anyway, I am also intrigued about the second problem exposed before, the retrieving of information allocated only in the Xeon Phi. If anyone could put basic code for simply allocating memory for an array in the Xeon Phi in one function  and getting this back to the host in another function, I woul appreciate it very much. I have tried everything and didn't find a similar example in the web.

0 Kudos
Kevin_D_Intel
Employee
1,002 Views

The Effective Use of Compiler Features for Offloading article discusses techniques for data persistence including pointers local to the coprocessor.

I'm having trouble publishing changes to that article at the moment, including adding this additional example:

When needing to pass the MIC local pointer between different routines on the CPU, you have to bring it back to the CPU, pass it around as you wish, and send it back to coprocessor when the data needs to be accessed on coprocessor again. A void* pointer type allows coprocessor pointers to be sent back and forth. Of course, a coprocessor pointer has no meaning on the CPU, so the programmer must take care to keep it on the CPU and pass it around, but use it only on MIC. See the example below.

 [cpp]#include <stdio.h>
#include <stdlib.h>

void main()
{
        float *mic_ptr3 = 0;
        void *device_ptr;

        // return pointer to allocated memory on target
        #pragma offload target(mic : 0) nocopy(mic_ptr3) out(device_ptr)
        {
                mic_ptr3 = (float *)malloc(100 * sizeof(float));
                mic_ptr3[0:100] = -1.0;
                device_ptr = (void*)mic_ptr3;
        }

        float sum;

        // reuse pointer to allocated memory on target
        #pragma offload target(mic : 0) nocopy(mic_ptr3) in(device_ptr) out(sum)
        {
                int i;
                sum=0;
                mic_ptr3 = (float*)device_ptr;
                for (i=0; i < 100; i++){
                        sum += mic_ptr3;
                }
        }
        printf("%f\n", sum);

        #pragma offload target(mic : 0) nocopy(mic_ptr3)
        {
                free(mic_ptr3); mic_ptr3 = NULL; // go away
        }
}[/cpp]

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,002 Views

Kevin,

Shouldn't the second offload include out(sum)?

Jim Dempsey

0 Kudos
Kevin_D_Intel
Employee
1,002 Views

Yes, it could so I added it. In the absence it defaults to INOUT.

Sorry about the extra blank lines. I haven't figured out the new magic to prevent those.

0 Kudos
Ravi_N_Intel
Employee
1,002 Views

In the end, yes, the problem was the setup of the memory. Why does it cost that much?

The cost of loading the image and creating the process on MIC is part of the 1st offload.  You can factor out this cost by using

export OFFLOAD_INIT=on_start.  Which sets up the MIC process when the host process is loaded.

0 Kudos
lilae
Beginner
1,002 Views

Thank you for answering. First of all, I didn't have time to try the things about the pointers that you explain but I really hope that works. 

Regarding the thing of the memory setup, the offload in which this problem is found is not the first offload of the program, in fact we are executing those things in a loop and it always happen at the same point. I only find this problem when the pragma includes an in  or nocopy and the alloc_if is set to true, that's why I am so shocked that this lasts so much since I suppose the main thing that  it does is to allocate the memory and this should not take that much time. Doing a simple malloc in MIC is much faster and that's why I will try to work with what Kevin has posted.

Anyway, has anyone notice this strange behaviour while using the pragmas?

0 Kudos
Reply