- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I am new to Xeon Phi coproccessors and I am now trying to adapt a program previously made for only CPU to offload to the Xeon Phi. While doing this I have found some strange results with the time neccessary for the transfer of data to the coprocessor. I am using simple offloading pragamas with in clauses to transfer an array of floats to the device. For example a data transfer of 2GB lasts for 3.08 seconds (677 MB/s) if we transfer it from CPU to the device and if we do it the other way around it lasts 0.315 s(6499 MB/s). We are using a PCIe 2.0 x16, so the theoretical bandwidth would be of 8GB/s. In the case of getting data from the device we almost get and ideal bandwidth but in the case of introducing data the bandwidth is not so good. I am thinking that maybe the channels of the bus are divided and only 2 lanes are dedicated to the host-device transfers and the reamining ones are dedicated to device-host transfers.
Also to try to overcome this problem, since the main array at first is empty, we decided to not transfer the array to the device but to create it directly on the device and at the end return it to the host. I have tried it in a number of ways using the offloading pragmas (with the into keyword) but always obtain the same error telling me that it cannot find the data associated with a pointer. I need the pointer to this dynamic memory region allocated in the MIC to be global since it is shared between different offloading calls. I'm wondering if anyone has experienced and successfully overcame this problem.
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the main array is empty, then your application has not "touched" this memory, and thus the pages in which it lies has not been mapped.
To verify this (sketch)
[cpp]
	const int ArraySize = 5000000;
	float* Array = (float*)malloc(ArraySize*sizeof(float));
	for(int I=0; I < ArraySize; ++ I)
	  Array = 0.0; // touch as well as zero
	...
	double tBegin = omp_get_wtime();
	YourFunctionWithOffloadToPhiHere();
	double tEnd;
	double tElapse = tEnd - tBegin;
	[/cpp]
Do not attribute virtual memory mapping to page file to offload speed.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your transfer of data from host to card includes the setup of the memory. The transfer from card to host may not include the setup cost depending on how you used the pragma.
Try creating the memory in a seperate offload pragma before you measure the host-> mic transfer by doing the following
	#pragma offload_transfer target(mic)  nocopy(A:lenght(size) alloc_if(1) free_if(0) 
	 and use the created memory in the pragma offload you are measuring with in(A:lenght(size)  alloc_if(0) ) clause.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your answers. In the end, yes, the problem was the setup of the memory. Why does it cost that much? Apart from allocating memory in the device what does this setup mean? We had measudred the cost of allocating memory in the device and it didn't seem so costly.
Anyway, I am also intrigued about the second problem exposed before, the retrieving of information allocated only in the Xeon Phi. If anyone could put basic code for simply allocating memory for an array in the Xeon Phi in one function and getting this back to the host in another function, I woul appreciate it very much. I have tried everything and didn't find a similar example in the web.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Effective Use of Compiler Features for Offloading article discusses techniques for data persistence including pointers local to the coprocessor.
I'm having trouble publishing changes to that article at the moment, including adding this additional example:
When needing to pass the MIC local pointer between different routines on the CPU, you have to bring it back to the CPU, pass it around as you wish, and send it back to coprocessor when the data needs to be accessed on coprocessor again. A void* pointer type allows coprocessor pointers to be sent back and forth. Of course, a coprocessor pointer has no meaning on the CPU, so the programmer must take care to keep it on the CPU and pass it around, but use it only on MIC. See the example below.
 [cpp]#include <stdio.h>
	#include <stdlib.h>
	
	void main()
	{
	        float *mic_ptr3 = 0;
	        void *device_ptr;
	
	        // return pointer to allocated memory on target
	        #pragma offload target(mic : 0) nocopy(mic_ptr3) out(device_ptr)
	        {
	                mic_ptr3 = (float *)malloc(100 * sizeof(float));
	                mic_ptr3[0:100] = -1.0;
	                device_ptr = (void*)mic_ptr3;
	        }
	
	        float sum;
	
	        // reuse pointer to allocated memory on target
	        #pragma offload target(mic : 0) nocopy(mic_ptr3) in(device_ptr) out(sum)
	        {
	                int i;
	                sum=0;
	                mic_ptr3 = (float*)device_ptr;
	                for (i=0; i < 100; i++){
	                        sum += mic_ptr3;
	                }
	        }
	        printf("%f\n", sum);
	
	        #pragma offload target(mic : 0) nocopy(mic_ptr3)
	        {
	                free(mic_ptr3); mic_ptr3 = NULL; // go away
	        }
	}[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kevin,
Shouldn't the second offload include out(sum)?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it could so I added it. In the absence it defaults to INOUT.
Sorry about the extra blank lines. I haven't figured out the new magic to prevent those.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the end, yes, the problem was the setup of the memory. Why does it cost that much?
The cost of loading the image and creating the process on MIC is part of the 1st offload. You can factor out this cost by using
export OFFLOAD_INIT=on_start. Which sets up the MIC process when the host process is loaded.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for answering. First of all, I didn't have time to try the things about the pointers that you explain but I really hope that works.
Regarding the thing of the memory setup, the offload in which this problem is found is not the first offload of the program, in fact we are executing those things in a loop and it always happen at the same point. I only find this problem when the pragma includes an in or nocopy and the alloc_if is set to true, that's why I am so shocked that this lasts so much since I suppose the main thing that it does is to allocate the memory and this should not take that much time. Doing a simple malloc in MIC is much faster and that's why I will try to work with what Kevin has posted.
Anyway, has anyone notice this strange behaviour while using the pragmas?
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page