Software Archive
Read-only legacy content

Intel Xeon Phi coprocessor - Offloading problem

H__Kamil
Beginner
501 Views

Hi everyone,

I am writing to you, since I met a problem associated with offload programming model. I am trying to use all avaliable resources of my platfrom: 2CPUs + 2MICs. I observed that when the CPUs use the same buffers which are copied to accelerators the time of computations grows up. For example, the normal execution time is 75 seconds, but when I copy the buffers to MICs the time grows up to ~90 seconds. In my platform, I use Intel Xeon Phi 7120P, icpc compiler (v 17.0.2), Red Hat 3.8 operating system, and Intel MPSS 3.7.2. 

What can be a problem? 
Thanks for help.

0 Kudos
5 Replies
jimdempseyatthecove
Honored Contributor III
501 Views

Kamil,

You should be aware (obvious once you think about it), that the first offload in the execution of your program incurs:

a) The overhead of copying the MIC code from the host into the coprocessor.
b) The coprocessor code is inevitably multi-threaded (e.g. OpenMP, TBB, Cilk, pthreads, ...) and thus the overhead of starting up the thread pool (64, 128, 196 or 256 threads).
c) The overhead of "first touch" of any allocated memory
d) The overhead of loading any shared libraries in the MIC (the .so files)
e) other...

While I do not think that this overhead will be 15 seconds, it will be significant, and should be accounted for when performing any timings of your program. Usually, one runs the test program in a loop, the first iteration will incur this overhead, the remainder iterations will give you a better estimation of the runtime.

This said, you may have an application that only runs an offload once. There are a few things you can do to mitigate this to some extent. You might want to read this: https://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

Read the section on Initialization Overhead and how to reduce the initial offload.

Also look at the other sections. Without seeing your program, it would be difficult for the readers of this post to make any meaningful suggestions. Read the article above first, experiment second, then come back here with your additional questions.

A small variation to the instructions for reducing Initialization Overhead would be to programmically overlap the initialization overhead with your host application initialization. While you can use the "OFFLOAD_INIT=on_start", it is not clear from the documentation that this occurs asynchronously of the remainder of your program (up to the first offload of the application). To provide asynchronous initialization you may experiment with not setting OFFLOAD_INIT (or setting it to OFFLOAD_INIT=on_offload) then:

int main()
{
    #pragma omp parallel
    {
        if((omp_num_threads() ==1) || (omp_get_thread_num() == 1))
        {
           // do this once by one thread other than master
           // (unless master is the only OpenMP thread on the host)
           #pragma offload_transfer target(mic)
        }
        #pragma omp master
        {
           ... program initialization here
        }
    } // end initialization parallel region
   // everything initialized here
    ... start computing

I haven't actually performed the above, I see no reason why it should not.

Jim Dempsey

0 Kudos
H__Kamil
Beginner
501 Views

Dear Jim,

first of all, I would like to thanks for interest of my problem.

I understand your point of view associated with initialization overhead, however I think that it is not a source of the problem. I created two tests for my application. In both, i allocate memory for data, transfer these data to MIC and after finishing data transfers I perform computations using CPUs only. Finally, I deallocate memory. CPUs computations are repeded several times. The code of the first test looks similar to presented bellow.

//Input arrays
double* a1 = ....;
double* a2 = ....;
...
double* an = ....;

//Allocation of memory in cards
#pragma offload target(mic:0) \
	in(a1 : length(size) alloc_if(1) free_if(0)) \
	in(a2 : length(size) alloc_if(1) free_if(0)) \
	...
	in(an: length(size) alloc_if(1) free_if(0))
	{ 	}

#pragma offload target(mic:1) \
	in(a1 : length(size) alloc_if(1) free_if(0)) \
	in(a2 : length(size) alloc_if(1) free_if(0)) \
	...
	in(an: length(size) alloc_if(1) free_if(0))
	{ 	}

//Parallel computations performed by CPU using arrays a1, a2, ..., an
for(i=0; i<5; ++i)
{
	double start = get_time();
	// OpenMP parallel computations here
        start_computations_cpu_only(a1, a2, ..., an);
	double stop = get_time();
	
	cout << "Time: " << stop - start << endl;
}

/* Rest of code */

 For the presented code, the CPU computations time is always quite to 90 seconds (88~90). So the the perofrmance overheads occur here. In this application, the computations are performed using 20 arrays. In such case, I copy to accelerator buffers that are utilized in computations by CPUs. The total amout of data transffered to accelerator is quite to 8GB. 

So, let's consider second test for my application. I write the version of the test where the CPU and MIC buffers are indepented. Its mean that I create two copies of buffers - the first for CPU and the second for MIC.  I transfer mic_* buffers to coprocessor. Look code bellow. 

//Input arrays
double* a1 = ....;
double* a2 = ....;
...
double* an = ....;
double* mic_a1 = ....;
double* mic_a2 = ....;
...
double* mic_an = ....;

//Allocation of memory in cards
#pragma offload target(mic:0) \
	in(mic_a1 : length(size) alloc_if(1) free_if(0)) \
	in(mic_a2 : length(size) alloc_if(1) free_if(0)) \
	...
	in(mic_an: length(size) alloc_if(1) free_if(0))
	{ 	}

#pragma offload target(mic:1) \
	in(mic_a1 : length(size) alloc_if(1) free_if(0)) \
	in(mic_a2 : length(size) alloc_if(1) free_if(0)) \
	...
	in(mic_an: length(size) alloc_if(1) free_if(0))
	{ 	}

//Parallel computations performed by CPU using arrays a1, a2, ..., an
for(i=0; i<5; ++i)
{
	double start = get_time();
	// OpenMP parallel computations here
        start_computations_cpu_only();
	double stop = get_time();
	
	cout << "Time: " << stop - start << endl;
}

/* Rest of code */

I such case, I have two copies with the same data. For this test code the execution time of CPU code is normal and it is always near to75 seconds. It is the expected value. It resolve my problem with performance, but it leads to rewriting the application source code and increasing its complexity. 

Based on the results of these test, I obseverd that the overheads occur when CPU and MIC use the same buffers. I know, it is very strange. Do you have any ideas what can be a problem ? 

Best regards.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
501 Views

Dear Kamil,

If the content of start_computations_cpu_only(); did not change between case 1 and case 2 (other than for offload code using mic_a*) and you see a performance difference of this significance (75:89), then one source of the difference is an array alignment issue. This is to say the allocations are not requested to be aligned, and by chance the second test had more favorable alignment than the first test.

Have you addressed data alignment issues?

These might be of assistance:

https://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-html/#optimized-offload-code

>>Based on the results of these test, I observed that the overheads occur when CPU and MIC use the same buffers.

Even though the buffers have the same name, they exist in different processes in different CPU's with different instruction sets, and the heap allocations as performed on each system are not necessarily of the same virtual addresses, and even if by chance they were, they are in different systems. Therefor the name of the variable is immaterial.

The Xeon Phi, especially the 7100 series, is optimally used with code having a majority of 512-bit SIMD instruction sequences (with aligned data). Code that is preponderantly scalar is not as effective. What is the coding style (mostly scalar or mostly vector/SIMD)?

Jim Dempsey

0 Kudos
H__Kamil
Beginner
501 Views

Dear Jim.

in both cases, I allocate memory using:

double* a1 = (double *)_mm_malloc(size*sizeof(double), 64);

For the CPU and MIC codes I use auto compiler-based vectorization. 

In both codes MIC does not perform any computations. I only allocate memory in coprocessor and transfer input data using #pragma offload target() clause.  Next I perform computations using CPU only, and finally deallocate memory. The most strange for me is fact that transfering data to MIC have a impact on the performance of CPU computations.

0 Kudos
jimdempseyatthecove
Honored Contributor III
501 Views

Oh... I though your code did computation in the MIC.

A few potential cause for variations of runtimes as you are experiencing:

1) Once scenario is operating with Turbo-Boost enabled and the other not.
2) Your start_computatons_cpu_only() is performing allocations to virtual memory never touched before in your process in one scenario (encountering page faults to the OS to perform the mapping of virtual address to page file and physical RAM) and the other scenario is not.
3) Your host system has 2 CPUs, and likely has a potential for 2 NUMA nodes. If (when) your system is configured for multiple NUMA nodes and if your allocations .AND. first touch are not NUMA aware, then it is possible that the CPU(s) that use the data (parts of the data) are not in the same NUMA node as the data in one case and are in the same NUMA node in the other case.
4) (related to 3) if you are not affinity pinning your threads, then the OpenMP threads that first touched (NUMA mapped) the then NUMA node of the thread could migrate to the other CPU (in one case and not the other).

Jim Dempsey

 

0 Kudos
Reply