Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
65 Views

MKL issue on MAC OS

Recently I've moved on a MAC PRO (2x2.26GHz Quad Core Intel Xeon 16GB DDR3).

I've tried to run a code previously tested on a Dual Core Intel PC with a 32 bit Windows Xp OS.

I noted that the code running on the MAC show a serious issue when running a MKL ruotine (djacobi). Infact the code is significantly slower than the Windows machine an does not furnish the same results.

The interested function is the djacobi function evaluating the jacobian matrix of a certain user assigned function.

I guess that the problem could be related to the compiling and/or linking option i've used on the two machine.

Accordingly i report the adopted option for both cases:

WINDOWS XP 32 bit (I'm using Visual Studio so i report the command line generated by VS)

/nologo /debug:full /Od /heap-arrays0 /I"D:\\Programmi\\Intel\\Compiler\\11.0\\066\\fortran\\mkl\\include\\ " /I"D:\\Programmi\\VNI\\imsl\\fnl600\\IA32\\include\\STATIC\\\\" /gen-interfaces /warn:interfaces /module:"Debug\\\\" /object:"Debug\\\\" /traceback /check:bounds /libs:static /threads /dbglibs /c

------------------

MAC OS 64 bit (this is the makefile since i'm using the command line)

INCLUDE=-I/opt/intel/Compiler/11.1/076/Frameworks/mkl/include/em64t/lp64
MKLINCLUDE=-I/opt/intel/Compiler/11.1/076/Frameworks/mkl/include
LIBDIR=/opt/intel/Compiler/11.1/076/lib/
LIBMKL=/opt/intel/Compiler/11.1/076/Frameworks/mkl/lib/em64t/
FC = ifort
FFLAGS1 =-O2 -heap-arrays -warn interfaces -check bounds -threads
#----------------------------------------------
TARGET:
@echo ".... Compiling DFTI MKL"
$(FC) /opt/intel/Compiler/11.1/076/Frameworks/mkl/include/mkl_dfti.f90 -c
@echo ".... Compiling NUFFT"
$(FC) $(FFLAGS1) Nufft.f90 -c
@echo ".... Compiling "
$(FC) $(FFLAGS1) Module1_FLAT.F90 Modulo_2.F90 SINTESI_MAIN.F90 -c
$(FC) $(FFLAGS1) -o sintesi.out *.o -L${LIBDIR} -L${LIBMKL} ${LIBMKL}/libmkl_intel_lp64.a ${LIBMKL}/libmkl_intel_thread.a ${LIBMKL}/libmkl_solver_lp64.a ${LIBMKL}/libmkl_core.a ${LIBDIR}libiomp5.a -lpthread
#----------------------------------------------

where the main program is SINTESI_MAIN.F90 using three modules: MODULE1_FLAT Modulo_2 and NUFFT.f90 (which uses mkl_dfti.f90)

Thanks

Clodxp

0 Kudos
13 Replies
Highlighted
Black Belt
65 Views

/Od on Windows ia32 compiler invokes x87 extra precision to some extent. If the difference is associated with the difference between -O2 and -Od (-O0 for Mac), you may need double precision somewhere, or you may want some options to disable non-standard optimizations; e.g.

-assume protect_parens

If the question is associated with MKL, the MKL forum is more suitable.

0 Kudos
Highlighted
Beginner
65 Views

Hi Tim,

thank you for your answer.

According to your suggestions i compiled using the following option, separetely:

-O0

-O2 -r8

-O2 -assume protect_parens

-O0 -assume protect_parens

Unfortunately i did not solve the problem.

At the moment i'm trying to identify the variables involved in the gradient evaluation that could need the double precision (since up to now i'm using real variable instead of double, except in the routine called by the djacobi where i'm using double precision)

Thank

C

0 Kudos
Highlighted
Beginner
65 Views


Are there suggestions regarding this topic?
Thank you
Clodxp

0 Kudos
Highlighted
Moderator
65 Views

Clodxp,
quote:"Infact the code is significantly slower than the Windows machine an does not furnish the same results."
Do you mean on Win 64 you have different results?
if yes,I don't think that the compiler options will affect on the such result.
How can we check the problem on our side?
Can you please give us the reproducible test case?
--Gennady


0 Kudos
Highlighted
Beginner
65 Views

Dear Gennady,
thank you for your answer.
I did not tried it on a Win64, but I can make a test on a Linux 64bit.
I'll let you know asap.
Moreover, i'll try to extract a smaller test code and i'll upload it on the forum.
Thank you
Clodxp
0 Kudos
Highlighted
Beginner
65 Views

Maybe i've found the issue causing the malfunctioning.
It is due to the use of multi-thread. In fact when i'm using OMP_NUM_THREADS=1 the program works fine on the MAC machine, while when i'm setting OMP_NUM_THREAD>1 i get the nan in the function of interest, as discussed in the initial post.
On the other hand, when i'm using OMP_NUM_THREADS>1 and MKL_NUM_THREADS=1 the code works fine.
Thank you
Clodxp
Maybe i've found the issue causing the malfunctioning.It is due to the use of multi-thread. In fact when i'm using OMP_NUM_THREADS=1 the program works fine on the MAC machine, while when i'm setting OMP_NUM_THREAD>1 i get the nan in the function of interest, as discussed in the initial post.On the other hand, when i'm using OMP_NUM_THREADS>1 and MKL_NUM_THREADS=1 the code works fine.
Thank youClodxp
0 Kudos
Highlighted
Moderator
65 Views

HelloClodxp,
That's very strange. I don't understand how it could be.How Can we check the problem on our side?
Can You upload the test?
Regards, Gennady
0 Kudos
Highlighted
Beginner
65 Views

Dear Gennady,
thank you for your answer.
Unfortunately the code is very long and require a lot of data to run. I'll try to extract the part of interest and i'll test it, so that i can upload.
Up to now i can try to explain the functioning of the part of interest. The djacobi function has to evaluate the jacobian of an external function F. This function F evaluates the distance between two functions, say Q and W. The first function Q is dependent upon the unknowns, to be sought for. The second function, W, is a desired target function: the aim of the code is to find the unknowns that minimize the distance between Q and W.
The function Q is essentially obtained by using a FFT (DftiComputeForward). As i known from the MKL manual, the FFT can be easily parallelized by using OMP_NUM_THREAD>1 or MKL_NUM_THREADS>1. I known that, if not specified, the number of MKL threads performing the threaded MKL function is equal to OMP_NUM_THREADS. As consequence i guess that the there is something wrong in the evaluation of the function Q, when i set MKL_NUM_THREADS>1. It's like that some data remains "dirty" when performing the parallel execution of the FFT.
Thank you.
Let me know if you have any suggestion.
As soon as i can i will upload the part of interest of my code.
Ciao
Clodxp
0 Kudos
Highlighted
Beginner
65 Views

I extracted a sample program. It has no physic sense but i hope it can be useful to check what's wrong in my actual code. As in my full code i've considered a module.
MAIN PROGRAM +++++++++++++++++++++++++++++++++++++++++++++++++++
PROGRAM JACOBI_MATRIX
use module1_DJAC
IMPLICIT NONE
INCLUDE 'MKL_RCI.fi'
EXTERNAL FCN
INTEGER N,M
PARAMETER(N=10,M=1)
double precision Fun
double precision jac_eps,gnew(1,N)
real g(N),res
!-------------------------------------------------------------------------
! CREATE FFT DESCRIPTOR
dim1=N
dim2=1
allocate(data_in(dim1*dim2))
length_FFT(1)=dim1;length_FFT(2)=dim2
! Create Descriptor
Status_FFT=DftiCreateDescriptor( Desc_Handle_FFT, DFTI_DOUBLE,DFTI_COMPLEX,2,length_FFT)
! Commit Descriptor
Status_FFT=DftiCommitDescriptor(Desc_Handle_FFT)
! INITIALIZE X and yn
allocate(X(N),yn(N))
X(1:3)=0.05*70;X(4:7)=0.1*70;X(8:10)=0.15*70
yn=1e-7;yn(5:8)=1.
! EVALUATE FCN (EXTERNAL FUNCTION)
! FCN is the norm of the difference of the FFT of the term exp(j*X) and the vector yn
! i.e. FCN=norm(FFT(exp(j*X)),yn)
call FCN(M,N,X,Fun)
print *,'Starting Fun =',Fun
! EVALUATE GRADIENT
gnew=0;jac_eps=1D-7;
res=djacobi(FCN,N,1,gnew(1,1:N),X(1:N),jac_eps)
print *,gnew(1,1:N)
end program JACOBI_MATRIX
! ***********************************************************************
subroutine FCN(M,N,X_IN,F)
use module1_DJAC
implicit none
integer M,N,M_out
external sdot
double precision X_IN(N),F
real m1(N),ps1,sdot
complex out(N)
!-----------------------------------
M_out=N
! CALL EVAL FFT FUNCTION (That realizes the FFT of exp(j*X_IN))
call EVAL_FFT(X_IN,N,out,M_out)
out=out/maxval(cabs(out))
m1=((cabs(out))**2)-yn
ps1=sdot(M_out,m1,1,m1,1)
F=ps1
end subroutine FCN
+++++++++++++++++++++++++++++++++++++++++++++++++++
MODULE
+++++++++++++++++++++++++++++++++++++++++++++++++++
module module1_DJAC
Use MKL_DFTI
implicit none
! FFT MKL DECLARATIONS
double complex, dimension(:),allocatable :: data_in
type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_FFT
integer Status_FFT,length_FFT(2),dim1,dim2
real Scale_FFT
! FUNCTION YN
real,dimension(:),allocatable ::yn
double precision,dimension(:),allocatable ::X
contains
! ***********************************************************************
subroutine EVAL_FFT(X_IN,N_X,OUT_VETT,N_OUT)
implicit none
integer N_X,N_OUT
double precision X_IN(N_X)
complex imag
parameter (imag=(0.,1.))
real dummy(N_X)
complex OUT_VETT(N_OUT)
!-----------------------------------
! Data_in=exp(j*X_IN)
dummy(1:N_X)=X(1:N_X)
data_in=cexp(imag*dummy)
! OUT_VETT=FFT(data_in)
Status_FFT=DftiComputeForward(Desc_Handle_FFT,data_in)
OUT_VETT=data_in
end subroutine EVAL_FFT
! ***********************************************************************
end module module1_DJAC
+++++++++++++++++++++++++++++++++++++++++++++++++++
When i use OMP_NUM_THREADS=4 and MKL_NUM_THREADS=1 the obtained gradient has all the components equal to 0.
THAT'S THE OUTPUT FOR MKL_NUM_THREADS=1
>>Mac-Pro-di-apple:TEST_DJACOBI claudio$ ./djacobi2.out
Starting Fun = 4.81621265411377
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000
When i use MKL_NUM_THREADS=4
>>Mac-Pro-di-apple:TEST_DJACOBI claudio$ export MKL_NUM_THREADS=4
Mac-Pro-di-apple:TEST_DJACOBI claudio$ ./djacobi2.out
Starting Fun = 4.81621265411377
-4442684.99101911 223754.474094936 1467580.79528809
0.000000000000000E+000 -1739650.11324201 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 -917188.099452427
0.000000000000000E+000
And different numbers are obtained each time i run the code
I hope the explanation of the issue is clear
Thank you
0 Kudos
Highlighted
Beginner
65 Views

I've modified the test program (now it does not contain a module and is simpler) and i attached the file.
I've run the file on the Windows 32 bit machine (using Visual Studio) and on the MAC machine (Leopard Snow OS).
That's what i've got:
>>> WINDOWS:
NUMBER OF ACTIVE THREADS ----> 2
Starting Fun = 3.33070492744446
-0.111034938267299 0.516687120710100 0.528608049665179
-0.496421541486468 -3.746577671595982E-003 3.916876656668527E-003
0.496421541486468 -0.528608049665178 -0.517254783993676
0.111489068894159
GRADIENT EVALUATION TIME= 4.699999999866122E-002
-------------------------------------------------------------------------------
>>> MAC (OMP_NUM_THREADS=4; MKL_NUM_THREADS=1) (even if i do not use the OMP threads)
NUMBER OF ACTIVE THREADS ----> 4
Starting Fun = 3.33070489925336
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000
GRADIENT EVALUATION TIME= 3.499200000078417E-002
-------------------------------------------------------------------------------
>>> MAC (OMP_NUM_THREADS=4; MKL_NUM_THREADS=4)
NUMBER OF ACTIVE THREADS ----> 4
Starting Fun = 3.33070489925336
419.025421142578 -2416.71494075230 -3924.92464610509
696.506159646171 -227.673053741455 -2490.09626252311
-2120.22244930267 1211.30568640573 -1522.04831441243
-3601.87212626139
GRADIENT EVALUATION TIME= 8.075900000403635E-002
-------------------------------------------------------------------------------
I also attached the makefile i use on the MAC
Please can someone test it and upload the results?
Thank you
Clodxp
0 Kudos
Highlighted
Moderator
65 Views

Clodxp,
looking at your makefile:
$(FC) $(FFLAGS1) TEST_DJACOBI3.f90 -o djacobi3.out *.o -L${LIBDIR} -L${LIBMKL} ${LIBMKL}/libmkl_intel_lp64.a ${LIBMKL}/libmkl_intel_thread.a ${LIBMKL}/libmkl_solver_lp64.a ${LIBMKL}/libmkl_core.a -lpthread
question: if you are linking with threaded lib's,why you missed liomp?
the end of your linking line should be like: ... -liomp - lpthreads isn't it?
--Gennady
0 Kudos
Highlighted
Beginner
65 Views



Gennady
I've tried -liomp but i've got : "library not found".
So i tried -liomp5 and it works, but the results do not change.
Thank you
C
0 Kudos
Highlighted
Moderator
65 Views

sorry for misprint.
Ok, need to check the problem on our side. I will back if any news.
--Gennady
0 Kudos