F90 SIMD/Parallel FFT

holmz · ‎08-10-2015

I found some related posts and links such as https://software.intel.com/en-us/node/507041#EEACAA31-C805-484F-B4AE-D8F49CEDD83D

However nothing that jumped out.

Most of what I do results in executables that run on a single core, and I have enough executables linked together that I end up with 'all' the stuff running most of the cores, such that they are relatively evenly loaded. However I also have some cases where the machine is idle and I want to run a single executable across all the cores. Additionally I also got access to a XEON-Phi. My previous efforts have mostly taken advantage of structuring code such that the vectorization report allows me to restructure to get plenty of automatic optimization without having to dig into OpenMP to extensively... But the times 'they are a changing'.

My goals are:

1) To get the code running the FFTs across multiple cores

2) To port to a MIC (XEON-Phi) which I have available to me, and use that to then move some other code to the MIC.

Here is where I am with a simple library (Since I am poking this in by hand, there may be typos!)...:

MODULE RHVECTMATH
  USE omp_lib
  IMPLICIT NONE
  PUBLIC :: RH$VSADD, RH$VSUM, 

  CONTAINS
...
...
  SUBROUTINE RH$RFFT(SIG, nFFT, FFT_Type)
  !DEC$ ATTRIBUTES DEFAULT, REFERENCE, ALIAS:'rh$rfft_'::rh$rfft
  USE MKL_DFTI
  IMPLICIT NONE
  REAL(KIND=4), DIMENSION(nFFT), INTENT(INOUT) :: Sig
  INTEGER(KIND=4)                             , INTENT(IN   ) :: nFFT
  INTEGER(KIND=4)                             , INTENT(IN   ) :: FFT_Type

  INTEGER(KIND=4)                                 :: Entires = 0
  INTEGER(KIND=4)                                 :: Status
  REAL(KIND=4), DIMENSION(nFFT)   :: SigCF
  COMPLEX(KIND=4), DIMENSION(nFFT)   :: SigCF
  LOGICAL(KIND=4)                                 :: Local_IO = .TRUE.

 TYPE(DFTI_DESCRIPTOR), POINTER :: My_Dec1_Handle
 Entries = Entries + 1
 Status = DftiCreateDescriptor(MyDesc1_Handle, DFTI_SINGLE, DFTI_REAL, 1 nFFT)
  IF(Local_IO == .TRUE. .AND. Entries <= 2) WRITE(11,*)'Threads:114  Status=',Status !<- SigDump  without
 Status = DftiSetValue(MyDesc1_Handle, DFTI_NOT_INPLACE)
!no  !$OMP PARALLEL DO SIMD PRIVATE(Entries) SHARED(Sig, SigSF)
 Status = DftiComputeForward(MyDesc1_Handle, Sig, SigSF)
!no  !$OMP END PARALLEL

 !$OMP PARALLEL DO  SHARED(Sig, SigSF)
Sig = SigSF
 !$OMP END PARALLEL

RETURN
END SUBROUTINE RH$RFFT

...
END MODULE RHVECTORMATH

Without the write at ^line #26^ it sigdumps whether I compile with -O0 or -O3.

Question #1: is what should I be doing differently with the MKL to avoid the sigdump without the write?

The library build is as follows:

DO_RH_DBG = -nod-lines
MIC = -D__MIC__ -mmic
NOMIC = -D__NOMIC__
MICOFFLOAD = -D__MIC__ -D__INTEL_OFFLOAD -offload-attribute-target=mic
F90_NO_WARN = ifort -132 ${DO_RH_DBG} -ip -inline-min-size=10 -inline-max-size=40000 inline-max-per-routine=100000 inline-max-total-size=2000 -inline-max-per-compile=500000 -recursive -fPIC -gen-interfaces source
F90_LINK = ifort -w1 -g -debug all -check-all -recursive -static-libcxa

RHINKS = -I$(<a path>)/inc
RHINKS3 = -I$(<a path>)/lib/rhvec
RHINKS4 = -I$(<another path>)/inc

LIB = ../../librhvec.a

rhobjects = threads.o mkl_dfti.o

all: $(LIB)

mkl_dfti.o: mkl_dfti.f90
   $F90_NO_WARN -O3 $(RHINCS) $(RHINCS4) -align -cpp -openmp -openmp-report2 -axSSE,AVX $NO_MIC -diag-file=mkl_dfti.compile_txt -opt-report 3 -opt-report-file=mkl_dfti.opt_rpt -Winline -vec-report3 -vec-guard-write $(RHINKS) -o $< -0 $@
...
threads.o: threads.f90 mkl_dfti.o
   $F90_NO_WARN -O3 $(RHINCS) $(RHINCS4) -align -cpp -openmp -openmp-report2 -axSSE,AVX $NO_MIC -diag-file=threads.compile_txt -opt-report 3 -opt-report-file=threads.opt_rpt -Winline -vec-report3 -vec-guard-write $(RHINKS) -o $< -0 $@
...


$(LIB): $(rhobjects)

  ar sr $(LIB) $(rhobjects)

...

The library compiles with only the warning:

ifort: command line warning #10152: option '-axSSE,AVX' not supported.

(This machine is using composer_xe_2013 and ifort is pointing to 14.1.106)

So question #2 is in which phase does the -axSSE get used? I am assuming it is not in preprocessor or linking, so it should be in the creation of the .o file?? From the documentation it appears that the -axSSE should work?? However none of the single -ax<> seem to work...

The calling code looks like:

!... reading in the data from file...
!buffer = Sig (N)
...
get = nFFT
CALL GRAB(<datasource>, Sig, get, got)
if(got == get)
Sig1 = Sig(
...
!DIR$ prefetch sig1
CALL RH$RFFT(Sig1, nFFT, 3)
endif
...
<more code>

The MKL implementation is giving the same results, but is 1/2 the speed of what was getting using .f77 code, and appears to be saturating a core.

Question #3:
I am assuming to run across many cores I will need something like the following... Or are there some MKL switches that enable the many cores to used?:

!... reading in the data from file...
!buffer = Sig (N * N_Cores)
...
get = nFFT * <N_Cores>
CALL GRAB(<datasource>, Sig, get, got)
...
!$OMP PARALLEL DO SIMD SHARED(SIG), PRIVATE(nFFT)
DO I = 1, N_Cores
  Start = ((I-1)*nFFT) + 1
  End = (I*nFFT)
  CALL RH$RFFT(Sig(Start:End), nFFT, 3)
ENDDO
!$OMP END PARALLEL DO
endif
...

Also it appears that there is no library or API for calling IPP FFT modules from Fortran??

Question #4:

What is the fastest way to run FFTs using the entire host (MKL, IPP)? (Are there any examples of .c++ interface wrappers?)

Am I correct in preferring OpenMP for spreading the processing across the cores? - Or are there MKL switches to achieve this?

Regardless of whether I am heading to the MIC/XEON-Phi, I would want the non-MIC machines to poke along faster that I am doing with the .f77 code. And understanding this should help me work through a totally different problem which I want to put onto the MIC.

Any insights appreciated.

TimP · ‎08-11-2015

Some of the questions seem to belong on the linux Fortran forum, but I'll take a start.

Your version of ifort is far too recent to have support for Pentium-III SSE. You must be looking at very old documentation. If you would set -axAVX, it would generate both SSE2 and AVX code paths, where it sees an advantage in AVX. Of course, when you set -mmic, the -ax option isn't valid.

I guess you are using the 32-bit ifort. If you are planning to use MIC, you will need to get used to the 64-bit "intel64" compilers. I don't think there is good support for MIC on host platforms which don't have AVX.

The option -align isn't much use without a specification, e.g. -align array32byte (or array64byte for MIC).

If you are looking to spread a single MKL function across all cores, the automatic threading built into MKL does that more effectively than use of OpenMP in your source code. Calling MKL inside omp parallel region puts the MKL function into single thread by default; the assumption is you will be specifying parallel operations in your source code. It may be quite difficult to get effective use of MIC if your MKL functions work on data sets which aren't of the size where MIC works best, but running multiple single thread MKL functions in parallel is not likely to be effective. The Automatic Offload feature of MKL is meant to ease the task of deciding when to use MIC.

Ying_H_Intel · ‎08-12-2015

Hi holmz,

Your issue seem related several questions. I try answer some of them.

1) Right, if possible, please try the latest version and 64bit on Xeon phi machine.

2) ) mic sample and how to build

MKL should provide some mic sample for example,

/opt/intel/composer_xe_2013/mkl/examples/mic_offload/dftf/source.

It provide build command. Could you please try them first and see if they can run ok on your machine.

3) the code itsefl

Status = DftiCreateDescriptor(MyDesc1_Handle, DFTI_SINGLE, DFTI_REAL, 1 , nFFT)

IF(Local_IO == .TRUE. .AND. Entries <= 2) WRITE(11,*)'Threads:114 Status=',Status !<- SigDump without

there is , between 1 and nFFT. Please check if it is missed?

After create, there should be commit operation. like if (0 == status) status = DftiCommitDescriptor( It seems messed in the code. Could you please add it before dft computing.

4) FFT in IPP and MKL , we have article to explain this.

https://software.intel.com/en-us/articles/mkl-ipp-choosing-an-fft/ Basically, i recommend MKL on Xeon and Xeon phi.

5) OpenMP thread and FFT thread. if you do in-place FFT, the code

	
				!$OMP PARALLEL DO  SHARED(Sig, SigSF)

	
				33
				Sig = SigSF

		
				34
				 !$OMP END PARALLEL
			
don't need,right?

there is some explaination in MKL manual, like Examples of Using Multi-Threading for FFT Computation
Using Parallel Mode with a Common Descriptor
C code for the example is as follows:
#include "mkl_dfti.h"
#include <omp.h>
#define ARRAY_LEN(a) sizeof(a)/sizeof(a[0])
int main ()
{
// 4 OMP threads, each does 2D FFT 50x100 points
MKL_Complex8 x[4][50][100];
int nth = ARRAY_LEN(x);
MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])};
DFTI_DESCRIPTOR_HANDLE FFT;
int th;
DftiCreateDescriptor (&FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len);
DftiCommitDescriptor (FFT);
// assume x is initialized and do 2D FFTs
#pragma omp parallel for shared(FFT, x)
for (th = 0; th < nth; th++)
DftiComputeForward (FFT, x[th]);
DftiFreeDescriptor (&FFT);
return 0;

could you please package all of your code and makefile. so we can investigate the threading issue directly?

Best Regards,

Ying