Details of Overhead and Data Copy Time

Amlesh_K_ · ‎08-19-2016

Hi,

Please consider the following piece of code -

program abcde
use omp_lib

integer thread_id
real*8 :: start, end_t, transfer_t, offload_t
integer , parameter :: m = 1000, n = 1000, k = 200
type phys
   integer , dimension(m,n) :: arr1
end type phys
type phys_copy
   integer , dimension(m,n) :: arr2
end type phys_copy

type(phys) :: array(k)
type(phys_copy) :: array_copy(k)

integer :: a, b, c, i, j

   do i = 1,k
      do j = 1,m
        do l = 1,n
          array(i)%arr1(j,l) = l
          array_copy(i)%arr2(j,l) = array(i)%arr1(j,l)
        end do
      end do
   end do

   do a = 1,3

   start = omp_get_wtime()
   !dir$ offload_transfer target(mic:0) in(array_copy)
   end_t = omp_get_wtime()

   print *,"Transfer Time = ",end_t - start

   start = omp_get_wtime()
    !dir$ offload begin target(mic:0) inout(array)

       do i = 1,k
         do j = 1,m
           do l = 1,n
              array(i)%arr1(j,l) = i
           end do
         end do
       end do

    !dir$ end offload
   end_t = omp_get_wtime()

   print *,"Offload Time = ",end_t - start

   end do

end program abcde

If I compile and run the above program, i receive the following output (say OUT1) -

[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    31
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        4.330932(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        1.579959(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   0 (bytes)

 Transfer Time =    4.34344100952148
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    37
[Offload] [MIC 0] [Tag]                     Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        3.142083(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        2.027888(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   800000000 (bytes)

 Offload Time =    3.15102696418762
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    31
[Offload] [MIC 0] [Tag]                     Tag 2
[Offload] [HOST]  [Tag 2] [CPU Time]        1.792265(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time]        1.117805(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data]   0 (bytes)

 Transfer Time =    1.79741597175598
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    37
[Offload] [MIC 0] [Tag]                     Tag 3
[Offload] [HOST]  [Tag 3] [CPU Time]        2.583275(seconds)
[Offload] [MIC 0] [Tag 3] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 3] [MIC Time]        1.566910(seconds)
[Offload] [MIC 0] [Tag 3] [MIC->CPU Data]   800000000 (bytes)

 Offload Time =    2.59064912796021
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    31
[Offload] [MIC 0] [Tag]                     Tag 4
[Offload] [HOST]  [Tag 4] [CPU Time]        1.796799(seconds)
[Offload] [MIC 0] [Tag 4] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 4] [MIC Time]        1.119486(seconds)
[Offload] [MIC 0] [Tag 4] [MIC->CPU Data]   0 (bytes)

 Transfer Time =    1.80196404457092
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    37
[Offload] [MIC 0] [Tag]                     Tag 5
[Offload] [HOST]  [Tag 5] [CPU Time]        2.584165(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time]        1.569251(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data]   800000000 (bytes)

 Offload Time =    2.59154081344604

Please note that offload_report=2.

Now when I comment the code section inside the offload directive,ie, after the offload begin and before the offload end directives, I get the following output (say OUT2) -

[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    31
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        4.117244(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        1.392524(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   0 (bytes)

 Transfer Time =    4.12900996208191
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    37
[Offload] [MIC 0] [Tag]                     Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        2.757605(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        1.644087(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   800000000 (bytes)

 Offload Time =    2.76546621322632
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    31
[Offload] [MIC 0] [Tag]                     Tag 2
[Offload] [HOST]  [Tag 2] [CPU Time]        1.591789(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time]        0.929934(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data]   0 (bytes)

 Transfer Time =    1.59638905525208
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    37
[Offload] [MIC 0] [Tag]                     Tag 3
[Offload] [HOST]  [Tag 3] [CPU Time]        2.176596(seconds)
[Offload] [MIC 0] [Tag 3] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 3] [MIC Time]        1.188246(seconds)
[Offload] [MIC 0] [Tag 3] [MIC->CPU Data]   800000000 (bytes)

 Offload Time =    2.18282485008240
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    31
[Offload] [MIC 0] [Tag]                     Tag 4
[Offload] [HOST]  [Tag 4] [CPU Time]        1.616834(seconds)
[Offload] [MIC 0] [Tag 4] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 4] [MIC Time]        0.952322(seconds)
[Offload] [MIC 0] [Tag 4] [MIC->CPU Data]   0 (bytes)

 Transfer Time =    1.62149286270142
[Offload] [MIC 0] [File]                    code1.f90
[Offload] [MIC 0] [Line]                    37
[Offload] [MIC 0] [Tag]                     Tag 5
[Offload] [HOST]  [Tag 5] [CPU Time]        2.192675(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data]   800000000 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time]        1.207339(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data]   800000000 (bytes)

 Offload Time =    2.19895482063293

Following are the queries -

1. What is MIC Time?

2. If MIC Time is what it is mentioned here, then why is it a non-zero value at places where it should have been zero? (it should be non-zero at all places in OUT2 and at offload transfers in OUT1).

3. In both the OUT1 and OUT2, it can be seen that the MIC Time and CPU Time for the first offload transfer and the first offload are higher than the subsequent offload transfers and offloads (subsequent times seem to kind of saturate at a value). Why is it higher in the first case? Is it just because of the data copy is somehow taking more time or is there some other hidden overhead? Also, if there is some kind of hidden overhead, then is that overhead present in the subsequent offloads too or just the first case? Also, what is the breakup of this CPU Time? (ie, is it something like MIC Time + Data Copy Time + Hidden Overheads?)

4. Let's say my CPU has 16-cores. Which core will handle the above synchronous data copy? Also, if I use asynchronous data copy, who will handle it? Is it one of these 16-cores? If yes, then it's not purely asynchronous, right? (Please note that running the mpssinfo command gave the following output -

        System Info
                HOST OS                 : Linux
                OS Version              : 2.6.32-358.el6.x86_64
                Driver Version          : 3.2.1-1
                MPSS Version            : 3.2.1
                Host Physical Memory    : 132052 MB


Device No: 0, Device Name: mic0

        Version
                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.2.1
                Device Serial Number     : ADKC32603320

Device No: 1, Device Name: mic1

        Version
                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.2.1
                Device Serial Number     : ADKC32603344


Other details removed.

The answers to above questions will be extremely helpful and enhance my understanding further. Please let me know of any more details needed.

Thanks in advance.

jimdempseyatthecove · ‎08-19-2016

1) MIC time is the execution time of the code run inside the MIC. For any timing runs, you should loop your test code to run at least 3 times. The first offload target will experience a significant hit in transferring the code of the offload into the MIC (including Fortran runtime code, and loading of any necessary libXXX.so files). You can use the difference in execution times of the first offload, and the third to estimate startup overhead for your application. This may or may not be of interest to you ("is" if you only have one offload, "is not" if you have many offloads).

By the way, with Fortran, when you use Array(i, j), have your loop nesting to have the inner most loop be the left most index "i" in this case. IOW outer most loop is right most index. In your sample code, with optimizations enabled, the compiler should have swapped loop order (but do not rely on this).

2) Don't know why the time was non-zero. Run the test with 1 iteration (get timing), then with 2 iterations (get timing), then with 3 iterations (get timing). The time difference between 1 iteration run and 2 iteration run can be inferred as initialization overhead + initial run time verses secondary runtime. The time difference between the 2 iteration run and the 3 iteration run can be inferred as the offload code execution time.

3) Any first offload transfer will incur the initialization overhead, including the first-touch overhead in mapping process (in MIC) virtual addresses to physical addresses.

4) The data copy is not handled by any core. It is handled in the hardware under DMA transfers. There will be some MPSS overhead (both host and MIC) in initializing the transfer as well as in handling the transfer completion interrupts. In MIC, this is usually performed by a pinned thread in the last core. On Host, I am not sure if the MPSS (COI) management thread runs in a pinned thread (you can determine this by writing a test program).

Jim Dempsey

Amlesh_K_ · ‎08-22-2016

Hi Jim,

Thanks for the above info.

Let's assume I've two Xeon Phi cards. Then will the data transfer to both these cards will be simultaneous using DMA? Also, will the same MPSS thread handle both the data transfers (as in that case, there will be some serialized work)?

Can you please point me to some articles where I can read more about MPSS and DMA?

Thanks.

jimdempseyatthecove · ‎08-22-2016

https://software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemsoftwaredevelopersguide.pdf

page 28 states 8 DMA channels (KNC)

But please note, with multiple transfers (with or without multiple cards) you may saturate the PCI Express bus(es).

Jim Dempsey

Amlesh_K_ · ‎08-25-2016

Thanks a lot Jim for the above information.

Amlesh_K_ · ‎08-25-2016

Hi,

Please consider the following -

I have a module variable in fortran, (module has the keywords save and private) say "global_var" (this is statically allocated). I also have a variable local to a function, say "local_var" (on the stack). Now, to transfer this "global_var" from xeon to Xeon Phi, I need to mark it with the !dir$ attributes offload : mic :: global_var, but I don't need to do any such thing with "local_var".

Transferring the "global_var" from Xeon to Xeon Phi incurs very small overheads (as statically allocated). Transferring "local_var" incurs high overheads (as it's a stack variable).

Now, when I mark this "local_var" as save (hence statically allocated), the overhead of transferring is increased. But, when I add !dir$ attributes offload : mic :: local_var (along with the save), the overhead of transferring is significantly reduced. (Please note that if I remove the save, the overhead is high again).

Please help me understand what's going on. I was following this article (mostly till "Persistence : Heap allocated data"), but it doesn't seem to be very helpful.

Thanks.

jimdempseyatthecove · ‎08-27-2016

I think what you may be observing is one (or both) of two things:

a) Memory alignment differences. Try aligning the variables that will transfer to 64 byte boundary, then 4096 byte boundary. (this is for the module vs SAVE)
b) "First touch" for heap/stack allocated data. This generally will happen on the first (few) iterations, so run your test code a few times in a loop to check the differences. Try alignment there too.

Good work on trying to pin down the circumstances.

Jim Dempsey

Amlesh_K_ · ‎09-14-2016

Hi Jim,

Sorry for replying late. Some issues with the compiler license.

I tried checking with different alignments. They seem to make no differences to the timings.

First touch shouldn't be a problem as I've multiple iterations.

So, the question still remains -

* With statically allocated variables, why do we need to mark them with "!dir$ attributes offload : mic" for faster transfer rates?

I am getting this issue in a different module also.

Thanks.

Paulius_V_ · ‎10-03-2016

Amlesh K. wrote:

Hi Jim,

Sorry for replying late. Some issues with the compiler license.

I tried checking with different alignments. They seem to make no differences to the timings.

First touch shouldn't be a problem as I've multiple iterations.

So, the question still remains -

* With statically allocated variables, why do we need to mark them with "!dir$ attributes offload : mic" for faster transfer rates?

I am getting this issue in a different module also.

Thanks.

What compiler are you using? I've gone from using LEO to using OpenMP4.0 (which calls LEO directives behind the scenes) and using the latest compiler I've noticed that I no longer need to mark anything with offload attributes. It would be interesting to compare your LEO implementation to using OpenMP