- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all, I want to send data into the MIC asynchronously.
I used the code below.
start_time = omp_get_wtime()
!dir$ offload_transfer target(mic:0)in(TRACER)signal(1)
end_time = omp_get_wtime()
print *,"time taken is ",end_time - start_time
TRACER here is a global variable marked target which has been imported from a different module.
However the time taken here is unusually high
approximately time taken is 8.975481986999512E-002
I do not see why does it take such a high value to asynchronously send data. It must just be a signal,which should take no more than a few milliseconds as I have noticed entire arrays being transferred in even shorter time.
I am I doing something wrong here?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. I suspect that the time to initialize the device is being included in this timing. Do an empty offload before this one to factor out the one-time initialization cost. For example:
!dir$ offload begin target(mic)
!dir$ end offload
That takes care of initialization time.
2. Next, if variable TRACER is statically allocated in your program, then it will be sent through a dynamically allocated buffer. This buffer creation time will be included in your timing. You want to avoid that.
3. Your program doesn't actually measure transfer time, just the time to *initiate* the transfer. Be aware of that.
4. In Fortran it is best to allocate dynamic arrays beforehand on the device with alloc_if(.true.) free_if(.false.) and then do the transfer using alloc_if(.false.) free_if(.false.) reusing the device buffers previously created. This will give the best time. But again, be aware of the difference between "transfer time" and "transfer initiation time". Measuring transfer time when doing asynchronous offloads is not generally possible because you won't be able to capture a time value at the precise moment the transfer completes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi I think the time reduced after I tried what you have suggested.
However its still as high as 2*10-2
I need as low 5*10-3 atleast for asynchronous transfer, any help.
Here is the code
if(flag == 1)then allocate(TRCR(nx_block,ny_block,km,nt)) !dir$ offload_transfer target(mic:0) nocopy( TRCR:alloc_if(.TRUE.) free_if(.FALSE.) ) allocate(WORK(nx_block,ny_block,km),WORKF(nx_block,ny_block,km),WORK3(nx_block,ny_block,km),WORK4(nx_block,ny_block,km)) !dir$ offload_transfer target(mic:0) nocopy( WORK:alloc_if(.TRUE.)free_if(.FALSE.)) !dir$ offload_transfer target(mic:0) nocopy( WORKF:alloc_if(.TRUE.)free_if(.FALSE.)) !dir$ offload_transfer target(mic:0) nocopy( WORK3:alloc_if(.TRUE.)free_if(.FALSE.)) !dir$ offload_transfer target(mic:0) nocopy( WORK4:alloc_if(.TRUE.)free_if(.FALSE.)) flag = 2 endif !if(my_task == master_task)then TRCR = TRACER (:,:,:,:,curtime,1) start_time = omp_get_wtime() !dir$ offload target(mic:0)in(TRCR:alloc_if(.FALSE.) free_if(.FALSE.)) out(WORKF,WORK3,WORK4,WORK:alloc_if(.FALSE.) free_if(.FALSE.))signal(1) call my_state_advt(TRCR(:,:,:,1),TRCR(:,:,:,2),& RHOFULL=WORKF,RHOOUT_WORK4=WORK4,RHOOUT_WORK3=WORK3,RHOOUT_WORK=WORK) !!dir$ end offload end_time = omp_get_wtime() !endif print *,end_time - start_time
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the MIC OFFLOAD REPORT says this
[Offload] [MIC 0] [Tag 72] [State] Start target entry: __offload_entry_baroclinic_F90_587baroclinic_mp_baroclinic_driver_ifort0101596643955Ee9p3L
[Offload] [MIC 0] [Tag 72] [Var] trcr IN
[Offload] [MIC 0] [Tag 72] [Var] trcr IN
[Offload] [MIC 0] [Tag 72] [Var] work OUT
[Offload] [MIC 0] [Tag 72] [Var] work OUT
[Offload] [MIC 0] [Tag 72] [Var] work4 OUT
[Offload] [MIC 0] [Tag 72] [Var] work4 OUT
[Offload] [MIC 0] [Tag 72] [Var] work3 OUT
[Offload] [MIC 0] [Tag 72] [Var] work3 OUT
[Offload] [MIC 0] [Tag 72] [Var] workf OUT
[Offload] [MIC 0] [Tag 72] [Var] workf OUT
[Offload] [MIC 0] [Tag 72] [State] Target->host copyout data 0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The offload will be done using async transfer of IN data, chained to an async compute, chained to an async transfer of OUT data.
However, the setup of the async data transfers involves programming the DMA channels, and that is done by the issuing thread. So the time taken to issue this offload will be proportional to the amount of data transferred IN and OUT.
Perhaps the amount of data is large?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page