Software Archive
Read-only legacy content
17061 Discussions

Intel LEO multiple copies of the same variable on MIC

Antonio_B_
Beginner
595 Views

Dear all,

I am extending SNU NPB OpenMP version to use LEO. I found a problem while converting the IS application. Considering that I do not have more than a couple of weeks of practice on LEO, I was wondering if this is a compiler bug, or a missing feature, etc. The problem is that when offloading pragma is used in/out/inout are not always respected depending on the underneath code (I am speaking about the C version): multiple copies of the same variable are created on MIC and those copies are not consistent. Here an example code:

#pragma offload target(mic) in(test_rank_array) in(test_index_array) inout(key_array) inout(key_buff2) inout(partial_verify_vals) inout(passed_verification) inout(key_buff1_aptr[0:(MAX_KEY*num_procs)-1] : alloc_if(1) free_if(1))
{
    printf("%s: in rank(1)_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
    rank( 1 );
    printf("%s: in rank(1)_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
}
    passed_verification = 0;
#pragma offload target(mic) in(test_rank_array) in(test_index_array) inout(key_array) inout(key_buff2) inout(partial_verify_vals) inout(passed_verification) inout(key_buff1_aptr[0:(MAX_KEY*num_procs)-1] : alloc_if(1) free_if(1))
{
    printf("%s: in iteration_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
    { if( CLASS != 'S' ) printf( "        %d\n", iteration ); rank( iteration );
    printf("%s: in iteration_for: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0); }
    printf("%s: in iteration_end: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
}

Note that the problem here is passed_verification. The rank function uses passed verification and during computation is also outputting the address of passed_verification. This is a typical output:

main: in rank(1)_start: 0 0x11223344
rank: compute: 1 0x11223344
main: in rank(1)_end: 1 0x11223344
main: in iteration_start: 0 0x55667788
rank: compute: 2 0x11223344
main: in iteration_for: 0 0x55667788
main: in iteration_for: 0 0x55667788
...

Therefore, I guess there are different copies of the same variable per scope and they are not synch between each other. I am not aware of any keyword to solve this problem I will be grateful to be informed about them. Note that manually unwinding the for loop leads to a correct result, i.e., only one var is deployed, example:

#pragma offload target(mic) in(test_rank_array) in(test_index_array) inout(key_array) inout(key_buff2) inout(partial_verify_vals) inout(passed_verification) inout(key_buff1_aptr[0:(MAX_KEY*num_procs)-1] : alloc_if(1) free_if(1))
{
    printf("%s: in iteration_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
    rank( 1 );
    ...
    rank( MAX_ITERATIONS );
    printf("%s: in iteration_end: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
}

I am using MPSS3.2.3, on Linux 2.6.38.8 (MIC), Linux 2.6.32.431 (Xeon Phi). ICC 14.0.3. The development platform comes directly from Intel.

Antonio

PS The code is not optimized to minimize data transfers.

 

0 Kudos
4 Replies
Kevin_D_Intel
Employee
595 Views

I cannot reproduce what is shown using a simplified test case that only considers a function-scope scalar passed_verification *except* when I force the two offloads to unique coprocessors or enclose the code (that uses target(mic) ) within an OMP parallel region with multiple threads and do not use shared(passed_verification).

I expect on multiple coprocessors the default scheduling across multiple cards will land each offload on card 0 so the same context should be used; however, you can try restricting the offloads to same coprocessor by modifying the target clause to be something like: target(mic:0)

If you have a single card but somewhere there’s OpenMP at an outer-level at play, maybe the variable requires declaration as shared.

If neither of these seems related then I would need more context in the form of a complete reproducing example to know whether you are hitting any sort of compiler or run-time defect or not.

0 Kudos
Antonio_B_
Beginner
595 Views

Dear Kevin,

thank you very much for your quick answer! My bad I didn't attached a tarball with the code, the buggy version is attached to this email, please forgive me that the code is not cleaned up. The code can be built with ./compile.sh, and run with ./is-offload. I am using a single Xeon Phi (I do not have plans to run on multiple Xoen Phis for now).

I tried to add at the end of the #pragma offload directive shared(passed_verification) but nothing changed. Again simply substituting inout with shared does not work. By the way I tried bi-dimensional array transfers with the special notation from https://software.intel.com/en-us/articles/xeon-phi-coprocessor-data-transfer-array-of-pointers-using-language-extensions-for-offload and they are not working ... which is the first version of ICC that supports this feature?

thanks,
Antonio

0 Kudos
Kevin_D_Intel
Employee
595 Views

Thanks for the example code. I will have a look at that.

shared() is an #pragma omp clause not a #pragma offload clause.

The data transfers for the array of pointers debuted in the Intel® Parallel Studio XE 2015 Composer Edition (15.0 compiler). The 14.0.3 you indicated using does not have this support.

0 Kudos
Kevin_D_Intel
Employee
595 Views

Thank you for the reproducer, that was very helpful. I confirmed there is an underlying defect in the 14.x compiler associated with the file-scope variable passed_verification. The optimizer (at -O2) mis-handles the variable association inside the offload regions when the rank( 1 ) call is active the first of those two offload constructs that call rank().

I confirmed the issue is fixed in our latest Parallel Studio XE 2015 Release (15.0 compiler).

I also confirmed the issue is avoidable by compiling at -O1 using the 14.0.3 compiler you currently have. You could perhaps continue using this compiler with -O1 for your exploration of adding offload, but given this is a benchmark, to measure optimized performance upgrading to the latest Parallel Studio XE 2015 Update 2 release is advised.

Our apologies for the defect, lost time, and confusion.

0 Kudos
Reply