Software Archive
Read-only legacy content
17061 Discussions

Asynchronous data send taking time

aketh_t_
Beginner
402 Views

Hi all, I want to send data into the MIC asynchronously.

I used the code below.

    start_time = omp_get_wtime()
    !dir$ offload_transfer target(mic:0)in(TRACER)signal(1)
    end_time = omp_get_wtime()

print *,"time taken is ",end_time - start_time

TRACER here is a global variable marked target which has been imported from a different module. 

However the time taken here is unusually high

approximately  time taken is   8.975481986999512E-002

I do not see why does it take such a high value to asynchronously send data. It must just be a signal,which should take no more than a few milliseconds as I have noticed entire arrays being transferred in even shorter time.

I am I doing something wrong here?

 

0 Kudos
4 Replies
Rajiv_D_Intel
Employee
402 Views

1. I suspect that the time to initialize the device is being included in this timing. Do an empty offload before this one to factor out the one-time initialization cost. For example:

 !dir$ offload begin target(mic)

!dir$ end offload

That takes care of initialization time.

2. Next, if variable TRACER is statically allocated in your program, then it will be sent through a dynamically allocated buffer. This buffer creation time will be included in your timing. You want to avoid that.

3. Your program doesn't actually measure transfer time, just the time to *initiate* the transfer. Be aware of that.

4. In Fortran it is best to allocate dynamic arrays beforehand on the device with alloc_if(.true.) free_if(.false.) and then do the transfer using alloc_if(.false.) free_if(.false.) reusing the device buffers previously created. This will give the best time. But again, be aware of the difference between "transfer time" and "transfer initiation time". Measuring transfer time when doing asynchronous offloads is not generally possible because you won't be able to capture a time value at the precise moment the transfer completes.

 

0 Kudos
aketh_t_
Beginner
402 Views

Hi I think the time reduced after I tried what you have suggested.

However its still as high as 2*10-2

I need as low 5*10-3 atleast for asynchronous transfer, any help.

Here is the code

 if(flag == 1)then
    allocate(TRCR(nx_block,ny_block,km,nt))
    !dir$ offload_transfer target(mic:0) nocopy( TRCR:alloc_if(.TRUE.) free_if(.FALSE.) )

    allocate(WORK(nx_block,ny_block,km),WORKF(nx_block,ny_block,km),WORK3(nx_block,ny_block,km),WORK4(nx_block,ny_block,km))
    !dir$ offload_transfer target(mic:0) nocopy( WORK:alloc_if(.TRUE.)free_if(.FALSE.))
    !dir$ offload_transfer target(mic:0) nocopy( WORKF:alloc_if(.TRUE.)free_if(.FALSE.))
    !dir$ offload_transfer target(mic:0) nocopy( WORK3:alloc_if(.TRUE.)free_if(.FALSE.))
    !dir$ offload_transfer target(mic:0) nocopy( WORK4:alloc_if(.TRUE.)free_if(.FALSE.))

    flag = 2
    endif

    !if(my_task == master_task)then 

    TRCR = TRACER (:,:,:,:,curtime,1)
    start_time = omp_get_wtime()
    !dir$ offload target(mic:0)in(TRCR:alloc_if(.FALSE.) free_if(.FALSE.)) out(WORKF,WORK3,WORK4,WORK:alloc_if(.FALSE.) free_if(.FALSE.))signal(1)
    call my_state_advt(TRCR(:,:,:,1),TRCR(:,:,:,2),&
    RHOFULL=WORKF,RHOOUT_WORK4=WORK4,RHOOUT_WORK3=WORK3,RHOOUT_WORK=WORK)
    !!dir$ end offload
    end_time = omp_get_wtime()

    !endif

    print *,end_time - start_time

 

0 Kudos
aketh_t_
Beginner
402 Views

the MIC OFFLOAD REPORT says this

[Offload] [MIC 0] [Tag 72] [State]           Start target entry: __offload_entry_baroclinic_F90_587baroclinic_mp_baroclinic_driver_ifort0101596643955Ee9p3L
[Offload] [MIC 0] [Tag 72] [Var]             trcr  IN
[Offload] [MIC 0] [Tag 72] [Var]             trcr  IN
[Offload] [MIC 0] [Tag 72] [Var]             work  OUT
[Offload] [MIC 0] [Tag 72] [Var]             work  OUT
[Offload] [MIC 0] [Tag 72] [Var]             work4  OUT
[Offload] [MIC 0] [Tag 72] [Var]             work4  OUT
[Offload] [MIC 0] [Tag 72] [Var]             work3  OUT
[Offload] [MIC 0] [Tag 72] [Var]             work3  OUT
[Offload] [MIC 0] [Tag 72] [Var]             workf  OUT
[Offload] [MIC 0] [Tag 72] [Var]             workf  OUT
[Offload] [MIC 0] [Tag 72] [State]           Target->host copyout data   0

 

0 Kudos
Rajiv_D_Intel
Employee
402 Views

The offload will be done using async transfer of IN data, chained to an async compute, chained to an async transfer of OUT data.

However, the setup of the async data transfers involves programming the DMA channels, and that is done by the issuing thread. So the time taken to issue this offload will be proportional to the amount of data transferred IN and OUT.

Perhaps the amount of data is large?

0 Kudos
Reply