- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It amazes me when see new stuff which happened again today.
In the Fortran compiler it says under OFFLOAD:
use, intrinsic :: iso_c_binding enum , bind (C) enumerator :: OFFLOAD_SUCCESS = 0 enumerator :: OFFLOAD_DISABLED = 1 ! offload is disabled enumerator :: OFFLOAD_UNAVAILABLE = 2 ! card is not available enumerator :: OFFLOAD_OUT_OF_MEMORY = 3 ! not enough memory on device enumerator :: OFFLOAD_PROCESS_DIED = 4 ! target process has died enumerator :: OFFLOAD_ERROR = 5 ! unspecified error end enum type, bind (C) :: offload_status integer(kind=c_int) :: result = OFFLOAD_DISABLED ! result, see enum above integer(kind=c_int) :: device_number = -1 ! device number integer(kind=c_int) :: data_sent = 0 ! number of bytes sent to the target integer(kind=c_int) :: data_received = 0 ! number of bytes received by host end type offload_status
So I poked into my code the following:
MODULE TYPE(offload_status), PUBLIC, DIMENSION(60) :: MICSTATUS !... more stuff LOGICAL(KIND=4), PARAMETER :: Yes = .TRUE. LOGICAL(KIND=4), PARAMETER :: No = .FALSE. LOGICAL(KIND=4), PARAMETER :: Amy = No !Or No no no !... more stuff END MODULE
In the main I have something like this:
!... !DIR$ ALIGN:64 DataIn REAL(KIND=4), DIMENSION(:,:), ALLOCATABLE :: DataIn !... ALLOCATE(DataIn(1024,60)) !... ! establish the allocation on the mic !DIR$ OFFLOAD)TRANSFER TARGET(mic:0) IN(DataIn: ALLOW_IF(YES) FREE_IF(NO) ) STATUS(MICStatus(1)) !DIR$ OFFLOAD_WAIT TAREGT(mic:0) WAIT(MICStatus(1)) WRITE(*,100) '100', 1, MICStatus(1).RESULT, MICStatus(1).DEVICE, MICStatus(1).DATA_SENT, MICStatus(1).DATA_RECEIVED 100 FORMAT(A,' MS(',I3,').RES=',I2, ' Dev=',I3 ' Tx=',I15, ' Rx=',I15) !... !A bigger loop DO I = 1, 60 !DIR$ OFFLOAD)TRANSFER TARGET(mic:0) IN(DataIn(:,I): ALLOW_IF(YES) FREE_IF(NO) ) STATUS(MICStatus(I)) !--- This stuff below was in a separate loop ... !DIR$ OFFLOAD_WAIT TAREGT(mic:0) WAIT(MICStatus(I)) WRITE(*,100) '120',I,MICStatus(I).RESULT, MICStatus(I).DEVICE, MICStatus(I).DATA_SENT, MICStatus(I).DATA_RECEIVED ENDDO !End of a bigger loop !... ! clean up the mic !DIR$ OFFLOAD)TRANSFER TARGET(mic:0) OUT(DataIn: ALLOW_IF(YES) FREE_IF(YES) ) STATUS(MICStatus(1)) !DIR$ OFFLOAD_WAIT TAREGT(mic:0) WAIT(MICStatus(1)) WRITE(*,100) '100', 1, MICStatus(1).RESULT, MICStatus(1).DEVICE, MICStatus(1).DATA_SENT, MICStatus(1).DATA_RECEIVED 100 FORMAT(A,' MS(',I3,').RES=',I2, ' Dev=',I3 ' Tx=',I15, ' Rx=',I15) DEALLOCATE(DataIn)
What I see if that only MICStatus(1) is showing the results correctly.
The sizeof(Status(1)) is 24 bytes, which I was expecting to be 16 (which is 4x C_INT).
Then I tried doing the following:
MODULE TYPE(offload_status), PUBLIC, DIMENSION(60) :: pSTATUS TYPE(offload_status), PUBLIC :: MICSTATUS1 TYPE(offload_status), PUBLIC :: MICSTATUS2 TYPE(offload_status), PUBLIC :: MICSTATUS3 TYPE(offload_status), PUBLIC :: MICSTATUS4 ... !... more stuff END MODULE
Followed by:
ALLOCATE(pMICSTATUS(60)) pMICSTATUS(1) => MICStatus1 pMICSTATUS(2) => MICStatus2
The last one failed as there are arguments for BIND(C) that require something... (??), and enumerator is a new one for me.
I just wast to get the data moved to the mic and then start scheduling the work on the mic as the data is on it, so I need to know how to handle the status tags as indexed array structure/type or with a pointer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure what you mean when you say "Today I saw that the TARGET(MIC:0) or Target(MIC:1) always points to mic:0." Are you saying you have two coprocessor cards, both of which are up and running but you can only use the first coprocessor card?
You say "I cannot really transfer to the core #1, or #2 etc on the first mic device. Only the first Mic which is (mic:0)." When you use any of the offload directives, you are offloading work to the coprocessor card, not to individual cores. Which cores get used depends on a number of things, including any affinity settings you used. Your program should use as many threads on as many cores as it can get useful work out of - which might or might not be all the cores and all the threads on each core you use.
You say "I am getting the impression that the SIGnal and the STATUS may really only be one per mic card". You may use multiple signals in a single process offloading to a single coprocessor card, as long as the integer value of the tag you use is different in each case; the integer value of the tag is the key used to track signals. The name of the variable holding that tag is irrelevant. As for the status option, think of it as you would an IOSTAT parameter on a Fortran open, read, write or close statement. It applies to the individual offload directive. When the directive returns control to the host processor, the status variable has been set to whatever it is going to be set to. You can check the status value returned, then reuse the status variable in another offload directive.
As far as timing, I would suggest you use OFFLOAD_REPORT to get more detailed information. You can find directions in Intel's Fortran reference manual. And, as I said before, it would be better to overlap the data transfer with offloaded work rather than overlap multiple data transfers, as I showed in the last bit of psuedocode.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There appear to be several inconsistencies or typos in your code.
- The offload directive takes a modifier ALLOC_IF, but your example uses ALLOW_IF.
- The WAIT clause takes a signal as its argument, while your code is using a STATUS variable.
I don't think your code would compile and run as written. Can you provide the actual code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No I cannot include the actual code as there is no internet connection at work, and at home I have ifort on a mac but there is no Xeon Phi available for a mac. So I poke it in from memory or a piece of paper. (And my spelling is not too good)
Basically the first transfer set up the allocate on the phi.
The tranfer in the loop moves the data onto the the ascyronously while the phi should be doing work on existing data.
The last transfer releases the phi memory.
The problem is the STATUS(MICSTATUS(J)) in the main loop
If J = 1 it works, or if I have separate MICSTATUS# for each J value. However it is not working with an array of MICStatus tags. It seems like it should be simple, but I am totally unfamiliar with ENUMERATOR and I do not often use BIND(C) .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you say the OFFLOAD_WAIT are in a separate loop, I hope what you mean is:
do big_loop=1,n
do i=1,60
start transfer
enddo
do i=1,60
wait transfer
enddo
enddo big_loop
and not
do i=1,60
start transfer
do j=1,n
wait transfer
enddo
enddo
You only get to wait once for each signal. If you meant the first, then what if you try:
DO BIG_LOOP=1,n DO I = 1, 60 !DIR$ OFFLOAD_TRANSFER TARGET(mic:0) IN(DataIn(:,I): ALLOW_IF(YES) FREE_IF(NO) ) STATUS(MICStatus) SIGNAL(I) !...check the status here – otherwise there is no point in putting the status clause on transfer ENDDO !... stuff happens DO I = 1, 60 !DIR$ OFFLOAD_WAIT TARGET(mic:0) STATUS(MICStatus) WAIT(I) WRITE(*,100) '120',I,MICStatus.RESULT, MICStatus.DEVICE, MICStatus.DATA_SENT, MICStatus.DATA_RECEIVED !... more stuff happens ENDDO ENDDO
I am only using one MICStatus variable. You process the offload directive, check the status result and move on.
I added a SIGNAL clause to the transfers - I'm not sure how you were getting asynchronous behavior without it - and set the tag for the SIGNAL and WAIT to the loop index. The important thing is that the integer value of the tag be the same for a matching signal/wait pair and different from every other signal/wait pair in use at the same time. This is one reason people often use the location of the data as the tag. But in this case, I am using 1 when transferring column 1; 2 when transferring column 2 and so on. The problem with using MICStatus for the signal or wait tag is that it is not an integer (although the first element is) and the value is not different for different signal/wait pairs.
I don't think I would do the allocate or free asynchronously; but if I did I would probably use a 1 for the signal/wait tag for those since there is no overlap with other asynchronous offload operations.
Finally, rather than start up all the transfers at once, I think I would try to keep just one ahead of where I wanted to be. In other words:
start transfer of first column
do i = 1,60
if i not equal 60 start the transfer of the next column
wait for previous column to finish transfer
do some work
enddo
I haven't actually tried this out but that is what I would do if I did.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes Francis - you first part was exactly what I mean.
I see I forgot the ENDDO and subsequent DO.
Today I saw that the TARGET(MIC:0) or Target(MIC:1) always points to mic:0.
I cannot really transfer to the core #1, or #2 etc on the first mic device. Only the first Mic which is (mic:0).
So with that I am getting the impression that the SIGnal and the STATUS may really only be one per mic card.
A colleague, who is a long ways away, claims to get threaded transfer rates of ~30GB/sec.
I am getting now (I am not sure), either 750 MB of transfer/sec, or 750 MB of 8 byte transfer/sec. I am 99% sure it is the former as I am summing the MICStatus.DATA_SENT and seeing and extra 80 bytes per transfer.
Basically an 8 byte variable array that is 1M long... +80 bytes for God and Intel know what FOR.
So my first step is to know what I transfer onto the MIC, then then what I can transfer on/off (Duplex should be the same), and then know what the processing is taking.
I appreciate your help Ms FR,
Cheers,
RH
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure what you mean when you say "Today I saw that the TARGET(MIC:0) or Target(MIC:1) always points to mic:0." Are you saying you have two coprocessor cards, both of which are up and running but you can only use the first coprocessor card?
You say "I cannot really transfer to the core #1, or #2 etc on the first mic device. Only the first Mic which is (mic:0)." When you use any of the offload directives, you are offloading work to the coprocessor card, not to individual cores. Which cores get used depends on a number of things, including any affinity settings you used. Your program should use as many threads on as many cores as it can get useful work out of - which might or might not be all the cores and all the threads on each core you use.
You say "I am getting the impression that the SIGnal and the STATUS may really only be one per mic card". You may use multiple signals in a single process offloading to a single coprocessor card, as long as the integer value of the tag you use is different in each case; the integer value of the tag is the key used to track signals. The name of the variable holding that tag is irrelevant. As for the status option, think of it as you would an IOSTAT parameter on a Fortran open, read, write or close statement. It applies to the individual offload directive. When the directive returns control to the host processor, the status variable has been set to whatever it is going to be set to. You can check the status value returned, then reuse the status variable in another offload directive.
As far as timing, I would suggest you use OFFLOAD_REPORT to get more detailed information. You can find directions in Intel's Fortran reference manual. And, as I said before, it would be better to overlap the data transfer with offloaded work rather than overlap multiple data transfers, as I showed in the last bit of psuedocode.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Francis,
What I mean is that I can get just under 1GByte/sec of DMA to the mic... ~950 GB/second
I believe it all goes through mic:0 (or the 1st mic which is the pcie address of the 1st or zeroth mic). So how much transfer rate should one expect?
So I am not sure if/how I get a higher transfer rate? I am <currently> transferring into a buffer in in 1M sample chunks with a complex(kind=4) size. I will be doing generally real(kind=4) or complex(kind=4), so 4 or 8 bytes.
The answer to the problem I solved in my mind last week after reading your post about "different in each case"... And I thought "did I initialise it?'
INTEGER(KIND=4), DIMENSION(60) :: Sig !?????
So I added INTEGER(KIND-4), DIMENSION(60) = SIG = (1,2,3,4,... 60) !Yes !!!
Which solved the main issue ;( ... <hanging head in shame>
I am pondering your pseudo code....
The last question is that I am transferring 1M of COMPLEX(KIND=4) and I get 8M + 80 bytes. What are those 80 bytes?
Thanks,
Randal
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page