MKL issue on MAC OS

clodxp · ‎03-05-2010

Recently I've moved on a MAC PRO (2x2.26GHz Quad Core Intel Xeon 16GB DDR3).

I've tried to run a code previously tested on a Dual Core Intel PC with a 32 bit Windows Xp OS.

I noted that the code running on the MAC show a serious issue when running a MKL ruotine (djacobi). Infact the code is significantly slower than the Windows machine an does not furnish the same results.

The interested function is the djacobi function evaluating the jacobian matrix of a certain user assigned function.

I guess that the problem could be related to the compiling and/or linking option i've used on the two machine.

Accordingly i report the adopted option for both cases:

WINDOWS XP 32 bit (I'm using Visual Studio so i report the command line generated by VS)

/nologo /debug:full /Od /heap-arrays0 /I"D:\\Programmi\\Intel\\Compiler\\11.0\\066\\fortran\\mkl\\include\\ " /I"D:\\Programmi\\VNI\\imsl\\fnl600\\IA32\\include\\STATIC\\\\" /gen-interfaces /warn:interfaces /module:"Debug\\\\" /object:"Debug\\\\" /traceback /check:bounds /libs:static /threads /dbglibs /c

------------------

MAC OS 64 bit (this is the makefile since i'm using the command line)

INCLUDE=-I/opt/intel/Compiler/11.1/076/Frameworks/mkl/include/em64t/lp64

MKLINCLUDE=-I/opt/intel/Compiler/11.1/076/Frameworks/mkl/include

LIBDIR=/opt/intel/Compiler/11.1/076/lib/

LIBMKL=/opt/intel/Compiler/11.1/076/Frameworks/mkl/lib/em64t/

FC = ifort

FFLAGS1 =-O2 -heap-arrays -warn interfaces -check bounds -threads

#----------------------------------------------

TARGET:

@echo ".... Compiling DFTI MKL"

$(FC) /opt/intel/Compiler/11.1/076/Frameworks/mkl/include/mkl_dfti.f90 -c

@echo ".... Compiling NUFFT"

$(FC) $(FFLAGS1) Nufft.f90 -c

@echo ".... Compiling "

$(FC) $(FFLAGS1) Module1_FLAT.F90 Modulo_2.F90 SINTESI_MAIN.F90 -c

$(FC) $(FFLAGS1) -o sintesi.out *.o -L${LIBDIR} -L${LIBMKL} ${LIBMKL}/libmkl_intel_lp64.a ${LIBMKL}/libmkl_intel_thread.a ${LIBMKL}/libmkl_solver_lp64.a ${LIBMKL}/libmkl_core.a ${LIBDIR}libiomp5.a -lpthread

#----------------------------------------------

where the main program is SINTESI_MAIN.F90 using three modules: MODULE1_FLAT Modulo_2 and NUFFT.f90 (which uses mkl_dfti.f90)

Thanks

Clodxp

TimP · ‎03-05-2010

/Od on Windows ia32 compiler invokes x87 extra precision to some extent. If the difference is associated with the difference between -O2 and -Od (-O0 for Mac), you may need double precision somewhere, or you may want some options to disable non-standard optimizations; e.g.

-assume protect_parens

If the question is associated with MKL, the MKL forum is more suitable.

clodxp · ‎03-08-2010

Hi Tim,

thank you for your answer.

According to your suggestions i compiled using the following option, separetely:

-O0

-O2 -r8

-O2 -assume protect_parens

-O0 -assume protect_parens

Unfortunately i did not solve the problem.

At the moment i'm trying to identify the variables involved in the gradient evaluation that could need the double precision (since up to now i'm using real variable instead of double, except in the routine called by the djacobi where i'm using double precision)

Thank

C

clodxp · ‎03-23-2010

Are there suggestions regarding this topic?
Thank you
Clodxp

Gennady_F_Intel · ‎03-23-2010

Clodxp,

quote:"Infact the code is significantly slower than the Windows machine an does not furnish the same results."

Do you mean on Win 64 you have different results?

if yes,I don't think that the compiler options will affect on the such result.

How can we check the problem on our side?

Can you please give us the reproducible test case?

--Gennady

clodxp · ‎03-24-2010

Dear Gennady,
thank you for your answer.
I did not tried it on a Win64, but I can make a test on a Linux 64bit.
I'll let you know asap.
Moreover, i'll try to extract a smaller test code and i'll upload it on the forum.
Thank you
Clodxp

clodxp · ‎04-19-2010

Maybe i've found the issue causing the malfunctioning.

It is due to the use of multi-thread. In fact when i'm using OMP_NUM_THREADS=1 the program works fine on the MAC machine, while when i'm setting OMP_NUM_THREAD>1 i get the nan in the function of interest, as discussed in the initial post.

On the other hand, when i'm using OMP_NUM_THREADS>1 and MKL_NUM_THREADS=1 the code works fine.

Thank you

Clodxp

Maybe i've found the issue causing the malfunctioning.It is due to the use of multi-thread. In fact when i'm using OMP_NUM_THREADS=1 the program works fine on the MAC machine, while when i'm setting OMP_NUM_THREAD>1 i get the nan in the function of interest, as discussed in the initial post.On the other hand, when i'm using OMP_NUM_THREADS>1 and MKL_NUM_THREADS=1 the code works fine.

Thank youClodxp

Gennady_F_Intel · ‎04-19-2010

HelloClodxp,

That's very strange. I don't understand how it could be.How Can we check the problem on our side?

Can You upload the test?

Regards, Gennady

clodxp · ‎04-20-2010

Dear Gennady,

thank you for your answer.

Unfortunately the code is very long and require a lot of data to run. I'll try to extract the part of interest and i'll test it, so that i can upload.

Up to now i can try to explain the functioning of the part of interest. The djacobi function has to evaluate the jacobian of an external function F. This function F evaluates the distance between two functions, say Q and W. The first function Q is dependent upon the unknowns, to be sought for. The second function, W, is a desired target function: the aim of the code is to find the unknowns that minimize the distance between Q and W.

The function Q is essentially obtained by using a FFT (DftiComputeForward). As i known from the MKL manual, the FFT can be easily parallelized by using OMP_NUM_THREAD>1 or MKL_NUM_THREADS>1. I known that, if not specified, the number of MKL threads performing the threaded MKL function is equal to OMP_NUM_THREADS. As consequence i guess that the there is something wrong in the evaluation of the function Q, when i set MKL_NUM_THREADS>1. It's like that some data remains "dirty" when performing the parallel execution of the FFT.

Thank you.

Let me know if you have any suggestion.

As soon as i can i will upload the part of interest of my code.

Ciao

Clodxp

clodxp · ‎04-26-2010

I extracted a sample program. It has no physic sense but i hope it can be useful to check what's wrong in my actual code. As in my full code i've considered a module.

MAIN PROGRAM +++++++++++++++++++++++++++++++++++++++++++++++++++

PROGRAM JACOBI_MATRIX

use module1_DJAC

IMPLICIT NONE

INCLUDE 'MKL_RCI.fi'

EXTERNAL FCN

INTEGER N,M

PARAMETER(N=10,M=1)

double precision Fun

double precision jac_eps,gnew(1,N)

real g(N),res

!-------------------------------------------------------------------------

! CREATE FFT DESCRIPTOR

dim1=N

dim2=1

allocate(data_in(dim1*dim2))

length_FFT(1)=dim1;length_FFT(2)=dim2

! Create Descriptor

Status_FFT=DftiCreateDescriptor( Desc_Handle_FFT, DFTI_DOUBLE,DFTI_COMPLEX,2,length_FFT)

! Commit Descriptor

Status_FFT=DftiCommitDescriptor(Desc_Handle_FFT)

! INITIALIZE X and yn

allocate(X(N),yn(N))

X(1:3)=0.05*70;X(4:7)=0.1*70;X(8:10)=0.15*70

yn=1e-7;yn(5:8)=1.

! EVALUATE FCN (EXTERNAL FUNCTION)

! FCN is the norm of the difference of the FFT of the term exp(j*X) and the vector yn

! i.e. FCN=norm(FFT(exp(j*X)),yn)

call FCN(M,N,X,Fun)

print *,'Starting Fun =',Fun

! EVALUATE GRADIENT

gnew=0;jac_eps=1D-7;

res=djacobi(FCN,N,1,gnew(1,1:N),X(1:N),jac_eps)

print *,gnew(1,1:N)

end program JACOBI_MATRIX

! ***********************************************************************

subroutine FCN(M,N,X_IN,F)

use module1_DJAC

implicit none

integer M,N,M_out

external sdot

double precision X_IN(N),F

real m1(N),ps1,sdot

complex out(N)

!-----------------------------------

M_out=N

! CALL EVAL FFT FUNCTION (That realizes the FFT of exp(j*X_IN))

call EVAL_FFT(X_IN,N,out,M_out)

out=out/maxval(cabs(out))

m1=((cabs(out))**2)-yn

ps1=sdot(M_out,m1,1,m1,1)

F=ps1

end subroutine FCN

+++++++++++++++++++++++++++++++++++++++++++++++++++

MODULE

+++++++++++++++++++++++++++++++++++++++++++++++++++

module module1_DJAC

Use MKL_DFTI

implicit none

! FFT MKL DECLARATIONS

double complex, dimension(:),allocatable :: data_in

type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_FFT

integer Status_FFT,length_FFT(2),dim1,dim2

real Scale_FFT

! FUNCTION YN

real,dimension(:),allocatable ::yn

double precision,dimension(:),allocatable ::X

contains

! ***********************************************************************

subroutine EVAL_FFT(X_IN,N_X,OUT_VETT,N_OUT)

implicit none

integer N_X,N_OUT

double precision X_IN(N_X)

complex imag

parameter (imag=(0.,1.))

real dummy(N_X)

complex OUT_VETT(N_OUT)

!-----------------------------------

! Data_in=exp(j*X_IN)

dummy(1:N_X)=X(1:N_X)

data_in=cexp(imag*dummy)

! OUT_VETT=FFT(data_in)

Status_FFT=DftiComputeForward(Desc_Handle_FFT,data_in)

OUT_VETT=data_in

end subroutine EVAL_FFT

! ***********************************************************************

end module module1_DJAC

+++++++++++++++++++++++++++++++++++++++++++++++++++

When i use OMP_NUM_THREADS=4 and MKL_NUM_THREADS=1 the obtained gradient has all the components equal to 0.

THAT'S THE OUTPUT FOR MKL_NUM_THREADS=1

>>Mac-Pro-di-apple:TEST_DJACOBI claudio$ ./djacobi2.out

Starting Fun = 4.81621265411377

0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000

0.000000000000000E+000

When i use MKL_NUM_THREADS=4

>>Mac-Pro-di-apple:TEST_DJACOBI claudio$ export MKL_NUM_THREADS=4

Mac-Pro-di-apple:TEST_DJACOBI claudio$ ./djacobi2.out

Starting Fun = 4.81621265411377

-4442684.99101911 223754.474094936 1467580.79528809

0.000000000000000E+000 -1739650.11324201 0.000000000000000E+000

0.000000000000000E+000 0.000000000000000E+000 -917188.099452427

0.000000000000000E+000

And different numbers are obtained each time i run the code

I hope the explanation of the issue is clear

Thank you

clodxp · ‎04-28-2010

I've modified the test program (now it does not contain a module and is simpler) and i attached the file.

I've run the file on the Windows 32 bit machine (using Visual Studio) and on the MAC machine (Leopard Snow OS).

That's what i've got:

>>> WINDOWS:

NUMBER OF ACTIVE THREADS ----> 2

Starting Fun = 3.33070492744446

-0.111034938267299 0.516687120710100 0.528608049665179

-0.496421541486468 -3.746577671595982E-003 3.916876656668527E-003

0.496421541486468 -0.528608049665178 -0.517254783993676

0.111489068894159

GRADIENT EVALUATION TIME= 4.699999999866122E-002

-------------------------------------------------------------------------------

>>> MAC (OMP_NUM_THREADS=4; MKL_NUM_THREADS=1) (even if i do not use the OMP threads)

NUMBER OF ACTIVE THREADS ----> 4

Starting Fun = 3.33070489925336

0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000

0.000000000000000E+000

GRADIENT EVALUATION TIME= 3.499200000078417E-002

-------------------------------------------------------------------------------

>>> MAC (OMP_NUM_THREADS=4; MKL_NUM_THREADS=4)

NUMBER OF ACTIVE THREADS ----> 4

Starting Fun = 3.33070489925336

419.025421142578 -2416.71494075230 -3924.92464610509

696.506159646171 -227.673053741455 -2490.09626252311

-2120.22244930267 1211.30568640573 -1522.04831441243

-3601.87212626139

GRADIENT EVALUATION TIME= 8.075900000403635E-002

-------------------------------------------------------------------------------

I also attached the makefile i use on the MAC

Please can someone test it and upload the results?

Thank you

Clodxp

Gennady_F_Intel · ‎04-28-2010

Clodxp,

looking at your makefile:

$(FC) $(FFLAGS1) TEST_DJACOBI3.f90 -o djacobi3.out *.o -L${LIBDIR} -L${LIBMKL} ${LIBMKL}/libmkl_intel_lp64.a ${LIBMKL}/libmkl_intel_thread.a ${LIBMKL}/libmkl_solver_lp64.a ${LIBMKL}/libmkl_core.a -lpthread

question: if you are linking with threaded lib's,why you missed liomp?

the end of your linking line should be like: ... -liomp - lpthreads isn't it?

--Gennady

clodxp · ‎04-28-2010

Gennady
I've tried -liomp but i've got : "library not found".
So i tried -liomp5 and it works, but the results do not change.
Thank you
C

Gennady_F_Intel · ‎04-28-2010

sorry for misprint.

Ok, need to check the problem on our side. I will back if any news.

--Gennady