Intel LEO multiple copies of the same variable on MIC

Antonio_B_ · ‎02-27-2015

Dear all,

I am extending SNU NPB OpenMP version to use LEO. I found a problem while converting the IS application. Considering that I do not have more than a couple of weeks of practice on LEO, I was wondering if this is a compiler bug, or a missing feature, etc. The problem is that when offloading pragma is used in/out/inout are not always respected depending on the underneath code (I am speaking about the C version): multiple copies of the same variable are created on MIC and those copies are not consistent. Here an example code:

#pragma offload target(mic) in(test_rank_array) in(test_index_array) inout(key_array) inout(key_buff2) inout(partial_verify_vals) inout(passed_verification) inout(key_buff1_aptr[0:(MAX_KEY*num_procs)-1] : alloc_if(1) free_if(1))
{
    printf("%s: in rank(1)_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
    rank( 1 );
    printf("%s: in rank(1)_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
}
    passed_verification = 0;
#pragma offload target(mic) in(test_rank_array) in(test_index_array) inout(key_array) inout(key_buff2) inout(partial_verify_vals) inout(passed_verification) inout(key_buff1_aptr[0:(MAX_KEY*num_procs)-1] : alloc_if(1) free_if(1))
{
    printf("%s: in iteration_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
    { if( CLASS != 'S' ) printf( "        %d\n", iteration ); rank( iteration );
    printf("%s: in iteration_for: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0); }
    printf("%s: in iteration_end: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
}

Note that the problem here is passed_verification. The rank function uses passed verification and during computation is also outputting the address of passed_verification. This is a typical output:

main: in rank(1)_start: 0 0x11223344
rank: compute: 1 0x11223344
main: in rank(1)_end: 1 0x11223344
main: in iteration_start: 0 0x55667788
rank: compute: 2 0x11223344
main: in iteration_for: 0 0x55667788
main: in iteration_for: 0 0x55667788
...

Therefore, I guess there are different copies of the same variable per scope and they are not synch between each other. I am not aware of any keyword to solve this problem I will be grateful to be informed about them. Note that manually unwinding the for loop leads to a correct result, i.e., only one var is deployed, example:

#pragma offload target(mic) in(test_rank_array) in(test_index_array) inout(key_array) inout(key_buff2) inout(partial_verify_vals) inout(passed_verification) inout(key_buff1_aptr[0:(MAX_KEY*num_procs)-1] : alloc_if(1) free_if(1))
{
    printf("%s: in iteration_start: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
    rank( 1 );
    ...
    rank( MAX_ITERATIONS );
    printf("%s: in iteration_end: %d %p\n", __func__, passed_verification, &passed_verification); fflush(0);
}

I am using MPSS3.2.3, on Linux 2.6.38.8 (MIC), Linux 2.6.32.431 (Xeon Phi). ICC 14.0.3. The development platform comes directly from Intel.

Antonio

PS The code is not optimized to minimize data transfers.

Kevin_D_Intel · ‎03-02-2015

I cannot reproduce what is shown using a simplified test case that only considers a function-scope scalar passed_verification *except* when I force the two offloads to unique coprocessors or enclose the code (that uses target(mic) ) within an OMP parallel region with multiple threads and do not use shared(passed_verification).

I expect on multiple coprocessors the default scheduling across multiple cards will land each offload on card 0 so the same context should be used; however, you can try restricting the offloads to same coprocessor by modifying the target clause to be something like: target(mic:0)

If you have a single card but somewhere there’s OpenMP at an outer-level at play, maybe the variable requires declaration as shared.

If neither of these seems related then I would need more context in the form of a complete reproducing example to know whether you are hitting any sort of compiler or run-time defect or not.

Antonio_B_ · ‎03-02-2015

Dear Kevin,

thank you very much for your quick answer! My bad I didn't attached a tarball with the code, the buggy version is attached to this email, please forgive me that the code is not cleaned up. The code can be built with ./compile.sh, and run with ./is-offload. I am using a single Xeon Phi (I do not have plans to run on multiple Xoen Phis for now).

I tried to add at the end of the #pragma offload directive shared(passed_verification) but nothing changed. Again simply substituting inout with shared does not work. By the way I tried bi-dimensional array transfers with the special notation from https://software.intel.com/en-us/articles/xeon-phi-coprocessor-data-transfer-array-of-pointers-using-language-extensions-for-offload and they are not working ... which is the first version of ICC that supports this feature?

thanks,
Antonio

Kevin_D_Intel · ‎03-02-2015

Thanks for the example code. I will have a look at that.

shared() is an #pragma omp clause not a #pragma offload clause.

The data transfers for the array of pointers debuted in the Intel® Parallel Studio XE 2015 Composer Edition (15.0 compiler). The 14.0.3 you indicated using does not have this support.

Kevin_D_Intel · ‎03-03-2015

Thank you for the reproducer, that was very helpful. I confirmed there is an underlying defect in the 14.x compiler associated with the file-scope variable passed_verification. The optimizer (at -O2) mis-handles the variable association inside the offload regions when the rank( 1 ) call is active the first of those two offload constructs that call rank().

I confirmed the issue is fixed in our latest Parallel Studio XE 2015 Release (15.0 compiler).

I also confirmed the issue is avoidable by compiling at -O1 using the 14.0.3 compiler you currently have. You could perhaps continue using this compiler with -O1 for your exploration of adding offload, but given this is a benchmark, to measure optimized performance upgrading to the latest Parallel Studio XE 2015 Update 2 release is advised.

Our apologies for the defect, lost time, and confusion.