Software Archive
Read-only legacy content
17061 Discussions

Using offload_trasfer Status() and offload_status

holmz
New Contributor I
535 Views

It amazes me when see new stuff which happened again today.
In the Fortran compiler it says under OFFLOAD:
 

use, intrinsic :: iso_c_binding
 
enum , bind (C)
 enumerator :: OFFLOAD_SUCCESS         = 0
 enumerator :: OFFLOAD_DISABLED        = 1  ! offload is disabled
 enumerator :: OFFLOAD_UNAVAILABLE     = 2  ! card is not available
 enumerator :: OFFLOAD_OUT_OF_MEMORY   = 3  ! not enough memory on device
 enumerator :: OFFLOAD_PROCESS_DIED    = 4  ! target process has died
 enumerator :: OFFLOAD_ERROR           = 5  ! unspecified error
end enum
 
type, bind (C) :: offload_status
 integer(kind=c_int) ::  result        = OFFLOAD_DISABLED   ! result, see enum above
 integer(kind=c_int) ::  device_number = -1  ! device number
 integer(kind=c_int) ::  data_sent     =  0  ! number of bytes sent to the target
 integer(kind=c_int) ::  data_received =  0  ! number of bytes received by host
end type offload_status

So I poked into my code the following:

MODULE

TYPE(offload_status), PUBLIC, DIMENSION(60) :: MICSTATUS
!... more stuff
LOGICAL(KIND=4), PARAMETER :: Yes = .TRUE.
LOGICAL(KIND=4), PARAMETER :: No = .FALSE.
LOGICAL(KIND=4), PARAMETER :: Amy = No   !Or No no no
!... more stuff
END MODULE

In the main I have something like this:

!...
!DIR$ ALIGN:64 DataIn
REAL(KIND=4), DIMENSION(:,:), ALLOCATABLE :: DataIn

!...
ALLOCATE(DataIn(1024,60))
!...

! establish the allocation on the mic
  !DIR$ OFFLOAD)TRANSFER TARGET(mic:0) IN(DataIn: ALLOW_IF(YES) FREE_IF(NO) ) STATUS(MICStatus(1))
!DIR$ OFFLOAD_WAIT TAREGT(mic:0) WAIT(MICStatus(1))
WRITE(*,100) '100', 1, MICStatus(1).RESULT, MICStatus(1).DEVICE, MICStatus(1).DATA_SENT, MICStatus(1).DATA_RECEIVED
100 FORMAT(A,' MS(',I3,').RES=',I2,  ' Dev=',I3 ' Tx=',I15, ' Rx=',I15)

!...
!A bigger loop
DO I = 1, 60
  !DIR$ OFFLOAD)TRANSFER TARGET(mic:0) IN(DataIn(:,I): ALLOW_IF(YES) FREE_IF(NO) ) STATUS(MICStatus(I))
!--- This stuff below was in a separate loop ...
!DIR$ OFFLOAD_WAIT TAREGT(mic:0) WAIT(MICStatus(I))
WRITE(*,100) '120',I,MICStatus(I).RESULT, MICStatus(I).DEVICE, MICStatus(I).DATA_SENT, MICStatus(I).DATA_RECEIVED
ENDDO
!End of a bigger loop
!...

! clean up the mic
  !DIR$ OFFLOAD)TRANSFER TARGET(mic:0) OUT(DataIn: ALLOW_IF(YES) FREE_IF(YES) ) STATUS(MICStatus(1))
!DIR$ OFFLOAD_WAIT TAREGT(mic:0) WAIT(MICStatus(1))
WRITE(*,100) '100', 1, MICStatus(1).RESULT, MICStatus(1).DEVICE, MICStatus(1).DATA_SENT, MICStatus(1).DATA_RECEIVED
100 FORMAT(A,' MS(',I3,').RES=',I2,  ' Dev=',I3 ' Tx=',I15, ' Rx=',I15)

DEALLOCATE(DataIn)

What I see if that only MICStatus(1) is showing the results correctly.

The sizeof(Status(1)) is 24 bytes, which I was expecting to be 16 (which is 4x C_INT).

Then I tried doing the following:

MODULE

TYPE(offload_status), PUBLIC, DIMENSION(60) :: pSTATUS
TYPE(offload_status), PUBLIC         :: MICSTATUS1
TYPE(offload_status), PUBLIC         :: MICSTATUS2
TYPE(offload_status), PUBLIC         :: MICSTATUS3
TYPE(offload_status), PUBLIC         :: MICSTATUS4
...

!... more stuff
END MODULE

Followed by:

ALLOCATE(pMICSTATUS(60))

pMICSTATUS(1) => MICStatus1
pMICSTATUS(2) => MICStatus2

The last one failed as there are arguments for BIND(C) that require something... (??), and enumerator is a new one for me.

I just wast to get the data moved to the mic and then start scheduling the work on the mic as the data is on it, so I need to know how to handle the status tags as indexed array structure/type or with a pointer.

0 Kudos
1 Solution
Frances_R_Intel
Employee
535 Views

I am not sure what you mean when you say "Today I saw that the TARGET(MIC:0) or Target(MIC:1) always points to mic:0." Are you saying you have two coprocessor cards, both of which are up and running but you can only use the first coprocessor card?

You say "I cannot really transfer to the core #1, or #2 etc on the first mic device. Only the first Mic which is (mic:0)." When you use any of the offload directives, you are offloading work to the coprocessor card, not to individual cores. Which cores get used depends on a number of things, including any affinity settings you used. Your program should use as many threads on as many cores as it can get useful work out of - which might or might not be all the cores and all the threads on each core you use.

You say "I am getting the impression that the SIGnal and the STATUS may really only be one per mic card". You may use multiple signals in a single process offloading to a single coprocessor card, as long as the integer value of the tag you use is different in each case; the integer value of the tag is the key used to track signals. The name of the variable holding that tag is irrelevant. As for the status option, think of it as you would an IOSTAT parameter on a Fortran open, read, write or close statement. It applies to the individual offload directive. When the directive returns control to the host processor, the status variable has been set to whatever it is going to be set to. You can check the status value returned, then reuse the status variable in another offload directive.

As far as timing, I would suggest you use OFFLOAD_REPORT to get more detailed information. You can find directions in Intel's Fortran reference manual. And, as I said before, it would be better to overlap the data transfer with offloaded work rather than overlap multiple data transfers, as I showed in the last bit of psuedocode.

View solution in original post

0 Kudos
6 Replies
Rajiv_D_Intel
Employee
535 Views

There appear to be several inconsistencies or typos in your code.

  1. The offload directive takes a modifier ALLOC_IF, but your example uses ALLOW_IF.
  2. The WAIT clause takes a signal as its argument, while your code is using a STATUS variable.

I don't think your code would compile and run as written. Can you provide the actual code?

 

0 Kudos
holmz
New Contributor I
535 Views

No I cannot include the actual code as there is no internet connection at work, and at home I have ifort on a mac but there is no Xeon Phi available for a mac. So I poke it in from memory or a piece of paper. (And my spelling is not too good)  

Basically the first transfer set up the allocate on the phi.

The tranfer in the loop moves the data onto the the ascyronously while the phi should be doing work on existing data.

The last transfer releases the phi memory.

The problem is the STATUS(MICSTATUS(J)) in the main loop

If J = 1 it works, or if I have separate MICSTATUS# for each J value. However it is not working with an array of MICStatus tags. It seems like it should be simple, but I am totally unfamiliar with ENUMERATOR and I do not often use BIND(C) . 

0 Kudos
Frances_R_Intel
Employee
535 Views

When you say the OFFLOAD_WAIT are in a separate loop, I hope what you mean is:

do big_loop=1,n

   do i=1,60

      start transfer

   enddo

   do i=1,60

      wait transfer

   enddo

enddo big_loop

and not

do i=1,60

   start transfer

   do j=1,n

      wait transfer

   enddo

enddo

You only get to wait once for each signal. If you meant the first, then what if you try:

DO BIG_LOOP=1,n
   DO I = 1, 60
      !DIR$ OFFLOAD_TRANSFER TARGET(mic:0) IN(DataIn(:,I): ALLOW_IF(YES) FREE_IF(NO) ) STATUS(MICStatus) SIGNAL(I)
      !...check the status here – otherwise there is no point in putting the status clause on transfer
   ENDDO
   !... stuff happens
   DO I = 1, 60
      !DIR$ OFFLOAD_WAIT TARGET(mic:0) STATUS(MICStatus) WAIT(I)
      WRITE(*,100) '120',I,MICStatus.RESULT, MICStatus.DEVICE, MICStatus.DATA_SENT, MICStatus.DATA_RECEIVED
      !... more stuff happens
   ENDDO
ENDDO

I am only using one MICStatus variable. You process the offload directive, check the status result and move on.

I added a SIGNAL clause to the transfers - I'm not sure how you were getting asynchronous behavior without it - and set the tag for the SIGNAL and WAIT to the loop index. The important thing is that the integer value of the tag be the same for a matching signal/wait pair and different from every other signal/wait pair in use at the same time. This is one reason people often use the location of the data as the tag. But in this case, I am using 1 when transferring column 1; 2 when transferring column 2 and so on. The problem with using MICStatus for the signal or wait tag is that it is not an integer (although the first element is) and the value is not different for different signal/wait pairs.

I don't think I would do the allocate or free asynchronously; but if I did I would probably use a 1 for the signal/wait tag for those since there is no overlap with other asynchronous offload operations.

Finally, rather than start up all the transfers at once, I think I would try to keep just one ahead of where I wanted to be. In other words:

start transfer of first column

do i = 1,60

   if i not equal 60 start the transfer of the next column

   wait for previous column to finish transfer

   do some work

enddo

I haven't actually tried this out but that is what I would do if I did.

0 Kudos
holmz
New Contributor I
535 Views

Yes Francis - you first part was exactly what I mean.

I see I forgot the ENDDO and subsequent DO.

Today I saw that the TARGET(MIC:0) or Target(MIC:1)  always points to mic:0.

​I cannot really transfer to the core #1, or #2 etc on the first mic device. Only the first Mic which is (mic:0).

So with that I am getting the impression that the SIGnal and the STATUS may really only be one per mic card.

A colleague, who is a long ways away, claims to get threaded transfer rates of ~30GB/sec.

I am getting now (I am not sure), either 750 MB of transfer/sec, or 750 MB of 8 byte transfer/sec. I am 99% sure it is the former as I am summing the MICStatus.DATA_SENT and seeing and extra 80 bytes per transfer.
Basically an 8 byte variable array that is 1M long... +80 bytes for God and Intel know what FOR.

So my first step is to know what I transfer onto the MIC, then then what I can transfer on/off (Duplex should be the same), and then know what the processing is taking.

I appreciate your help Ms FR,
Cheers,
RH

0 Kudos
Frances_R_Intel
Employee
536 Views

I am not sure what you mean when you say "Today I saw that the TARGET(MIC:0) or Target(MIC:1) always points to mic:0." Are you saying you have two coprocessor cards, both of which are up and running but you can only use the first coprocessor card?

You say "I cannot really transfer to the core #1, or #2 etc on the first mic device. Only the first Mic which is (mic:0)." When you use any of the offload directives, you are offloading work to the coprocessor card, not to individual cores. Which cores get used depends on a number of things, including any affinity settings you used. Your program should use as many threads on as many cores as it can get useful work out of - which might or might not be all the cores and all the threads on each core you use.

You say "I am getting the impression that the SIGnal and the STATUS may really only be one per mic card". You may use multiple signals in a single process offloading to a single coprocessor card, as long as the integer value of the tag you use is different in each case; the integer value of the tag is the key used to track signals. The name of the variable holding that tag is irrelevant. As for the status option, think of it as you would an IOSTAT parameter on a Fortran open, read, write or close statement. It applies to the individual offload directive. When the directive returns control to the host processor, the status variable has been set to whatever it is going to be set to. You can check the status value returned, then reuse the status variable in another offload directive.

As far as timing, I would suggest you use OFFLOAD_REPORT to get more detailed information. You can find directions in Intel's Fortran reference manual. And, as I said before, it would be better to overlap the data transfer with offloaded work rather than overlap multiple data transfers, as I showed in the last bit of psuedocode.

0 Kudos
holmz
New Contributor I
535 Views

Hi Francis,

What I mean is that I can get just under 1GByte/sec of DMA to the mic... ~950 GB/second
I believe it all goes through mic:0 (or the 1st mic which is the pcie address of the 1st or zeroth mic). So how much transfer rate should one expect?

So I am not sure if/how I get a higher transfer rate? I am <currently> transferring into a buffer in in 1M sample chunks with a complex(kind=4) size. I will be doing generally real(kind=4) or complex(kind=4), so 4 or 8 bytes.

The answer to the problem I solved in my mind last week after reading your post about "different in each case"... And I thought "did I initialise it?'
INTEGER(KIND=4), DIMENSION(60) :: Sig !?????

So I added INTEGER(KIND-4), DIMENSION(60) = SIG = (1,2,3,4,... 60) !Yes !!!

Which solved the main issue ;( ... <hanging head in shame>

I am pondering your pseudo code....

The last question is that I am transferring 1M of COMPLEX(KIND=4) and I get 8M + 80 bytes. What are those 80 bytes?

Thanks,
Randal

0 Kudos
Reply