- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I found some related posts and links such as https://software.intel.com/en-us/node/507041#EEACAA31-C805-484F-B4AE-D8F49CEDD83D
However nothing that jumped out.
Most of what I do results in executables that run on a single core, and I have enough executables linked together that I end up with 'all' the stuff running most of the cores, such that they are relatively evenly loaded. However I also have some cases where the machine is idle and I want to run a single executable across all the cores. Additionally I also got access to a XEON-Phi. My previous efforts have mostly taken advantage of structuring code such that the vectorization report allows me to restructure to get plenty of automatic optimization without having to dig into OpenMP to extensively... But the times 'they are a changing'.
My goals are:
1) To get the code running the FFTs across multiple cores
2) To port to a MIC (XEON-Phi) which I have available to me, and use that to then move some other code to the MIC.
Here is where I am with a simple library (Since I am poking this in by hand, there may be typos!)...:
MODULE RHVECTMATH USE omp_lib IMPLICIT NONE PUBLIC :: RH$VSADD, RH$VSUM, CONTAINS ... ... SUBROUTINE RH$RFFT(SIG, nFFT, FFT_Type) !DEC$ ATTRIBUTES DEFAULT, REFERENCE, ALIAS:'rh$rfft_'::rh$rfft USE MKL_DFTI IMPLICIT NONE REAL(KIND=4), DIMENSION(nFFT), INTENT(INOUT) :: Sig INTEGER(KIND=4) , INTENT(IN ) :: nFFT INTEGER(KIND=4) , INTENT(IN ) :: FFT_Type INTEGER(KIND=4) :: Entires = 0 INTEGER(KIND=4) :: Status REAL(KIND=4), DIMENSION(nFFT) :: SigCF COMPLEX(KIND=4), DIMENSION(nFFT) :: SigCF LOGICAL(KIND=4) :: Local_IO = .TRUE. TYPE(DFTI_DESCRIPTOR), POINTER :: My_Dec1_Handle Entries = Entries + 1 Status = DftiCreateDescriptor(MyDesc1_Handle, DFTI_SINGLE, DFTI_REAL, 1 nFFT) IF(Local_IO == .TRUE. .AND. Entries <= 2) WRITE(11,*)'Threads:114 Status=',Status !<- SigDump without Status = DftiSetValue(MyDesc1_Handle, DFTI_NOT_INPLACE) !no !$OMP PARALLEL DO SIMD PRIVATE(Entries) SHARED(Sig, SigSF) Status = DftiComputeForward(MyDesc1_Handle, Sig, SigSF) !no !$OMP END PARALLEL !$OMP PARALLEL DO SHARED(Sig, SigSF) Sig = SigSF !$OMP END PARALLEL RETURN END SUBROUTINE RH$RFFT ... END MODULE RHVECTORMATH
Without the write at ^line #26^ it sigdumps whether I compile with -O0 or -O3.
Question #1: is what should I be doing differently with the MKL to avoid the sigdump without the write?
The library build is as follows:
DO_RH_DBG = -nod-lines MIC = -D__MIC__ -mmic NOMIC = -D__NOMIC__ MICOFFLOAD = -D__MIC__ -D__INTEL_OFFLOAD -offload-attribute-target=mic F90_NO_WARN = ifort -132 ${DO_RH_DBG} -ip -inline-min-size=10 -inline-max-size=40000 inline-max-per-routine=100000 inline-max-total-size=2000 -inline-max-per-compile=500000 -recursive -fPIC -gen-interfaces source F90_LINK = ifort -w1 -g -debug all -check-all -recursive -static-libcxa RHINKS = -I$(<a path>)/inc RHINKS3 = -I$(<a path>)/lib/rhvec RHINKS4 = -I$(<another path>)/inc LIB = ../../librhvec.a rhobjects = threads.o mkl_dfti.o all: $(LIB) mkl_dfti.o: mkl_dfti.f90 $F90_NO_WARN -O3 $(RHINCS) $(RHINCS4) -align -cpp -openmp -openmp-report2 -axSSE,AVX $NO_MIC -diag-file=mkl_dfti.compile_txt -opt-report 3 -opt-report-file=mkl_dfti.opt_rpt -Winline -vec-report3 -vec-guard-write $(RHINKS) -o $< -0 $@ ... threads.o: threads.f90 mkl_dfti.o $F90_NO_WARN -O3 $(RHINCS) $(RHINCS4) -align -cpp -openmp -openmp-report2 -axSSE,AVX $NO_MIC -diag-file=threads.compile_txt -opt-report 3 -opt-report-file=threads.opt_rpt -Winline -vec-report3 -vec-guard-write $(RHINKS) -o $< -0 $@ ... $(LIB): $(rhobjects) ar sr $(LIB) $(rhobjects)
...
The library compiles with only the warning:
ifort: command line warning #10152: option '-axSSE,AVX' not supported.
(This machine is using composer_xe_2013 and ifort is pointing to 14.1.106)
So question #2 is in which phase does the -axSSE get used? I am assuming it is not in preprocessor or linking, so it should be in the creation of the .o file?? From the documentation it appears that the -axSSE should work?? However none of the single -ax<> seem to work...
The calling code looks like:
!... reading in the data from file... !buffer = Sig (N) ... get = nFFT CALL GRAB(<datasource>, Sig, get, got) if(got == get) Sig1 = Sig( ... !DIR$ prefetch sig1 CALL RH$RFFT(Sig1, nFFT, 3) endif ... <more code>
The MKL implementation is giving the same results, but is 1/2 the speed of what was getting using .f77 code, and appears to be saturating a core.
Question #3:
I am assuming to run across many cores I will need something like the following... Or are there some MKL switches that enable the many cores to used?:
!... reading in the data from file... !buffer = Sig (N * N_Cores) ... get = nFFT * <N_Cores> CALL GRAB(<datasource>, Sig, get, got) ... !$OMP PARALLEL DO SIMD SHARED(SIG), PRIVATE(nFFT) DO I = 1, N_Cores Start = ((I-1)*nFFT) + 1 End = (I*nFFT) CALL RH$RFFT(Sig(Start:End), nFFT, 3) ENDDO !$OMP END PARALLEL DO endif ...
Also it appears that there is no library or API for calling IPP FFT modules from Fortran??
Question #4:
What is the fastest way to run FFTs using the entire host (MKL, IPP)? (Are there any examples of .c++ interface wrappers?)
Am I correct in preferring OpenMP for spreading the processing across the cores? - Or are there MKL switches to achieve this?
Regardless of whether I am heading to the MIC/XEON-Phi, I would want the non-MIC machines to poke along faster that I am doing with the .f77 code. And understanding this should help me work through a totally different problem which I want to put onto the MIC.
Any insights appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some of the questions seem to belong on the linux Fortran forum, but I'll take a start.
Your version of ifort is far too recent to have support for Pentium-III SSE. You must be looking at very old documentation. If you would set -axAVX, it would generate both SSE2 and AVX code paths, where it sees an advantage in AVX. Of course, when you set -mmic, the -ax option isn't valid.
I guess you are using the 32-bit ifort. If you are planning to use MIC, you will need to get used to the 64-bit "intel64" compilers. I don't think there is good support for MIC on host platforms which don't have AVX.
The option -align isn't much use without a specification, e.g. -align array32byte (or array64byte for MIC).
If you are looking to spread a single MKL function across all cores, the automatic threading built into MKL does that more effectively than use of OpenMP in your source code. Calling MKL inside omp parallel region puts the MKL function into single thread by default; the assumption is you will be specifying parallel operations in your source code. It may be quite difficult to get effective use of MIC if your MKL functions work on data sets which aren't of the size where MIC works best, but running multiple single thread MKL functions in parallel is not likely to be effective. The Automatic Offload feature of MKL is meant to ease the task of deciding when to use MIC.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi holmz,
Your issue seem related several questions. I try answer some of them.
1) Right, if possible, please try the latest version and 64bit on Xeon phi machine.
2) ) mic sample and how to build
MKL should provide some mic sample for example,
/opt/intel/composer_xe_2013/mkl/examples/mic_offload/dftf/source.
It provide build command. Could you please try them first and see if they can run ok on your machine.
3) the code itsefl
Status
= DftiCreateDescriptor(MyDesc1_Handle, DFTI_SINGLE, DFTI_REAL, 1 , nFFT)
IF
(Local_IO == .TRUE. .AND. Entries <= 2)
WRITE
(11,*)
'Threads:114 Status='
,
Status
!<- SigDump without
there is , between 1 and nFFT. Please check if it is missed?
After create, there should be commit operation. like if (0 == status) status = DftiCommitDescriptor( It seems messed in the code. Could you please add it before dft computing.
4) FFT in IPP and MKL , we have article to explain this.
https://software.intel.com/en-us/articles/mkl-ipp-choosing-an-fft/ Basically, i recommend MKL on Xeon and Xeon phi.
5) OpenMP thread and FFT thread. if you do in-place FFT, the code
!$OMP PARALLEL DO SHARED(Sig, SigSF) |
33 |
Sig = SigSF |
34 |
!$OMP END PARALLEL |
don't need,right?
there is some explaination in MKL manual, like Examples of Using Multi-Threading for FFT Computation
Using Parallel Mode with a Common Descriptor
C code for the example is as follows:
#include "mkl_dfti.h"
#include <omp.h>
#define ARRAY_LEN(a) sizeof(a)/sizeof(a[0])
int main ()
{
// 4 OMP threads, each does 2D FFT 50x100 points
MKL_Complex8 x[4][50][100];
int nth = ARRAY_LEN(x);
MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])};
DFTI_DESCRIPTOR_HANDLE FFT;
int th;
DftiCreateDescriptor (&FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len);
DftiCommitDescriptor (FFT);
// assume x is initialized and do 2D FFTs
#pragma omp parallel for shared(FFT, x)
for (th = 0; th < nth; th++)
DftiComputeForward (FFT, x[th]);
DftiFreeDescriptor (&FFT);
return 0;
could you please package all of your code and makefile. so we can investigate the threading issue directly?
Best Regards,
Ying

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page