ifort/MKL vs Julia/OpenBLAS: ifort/MKL using less cores

d_3 · ‎10-10-2018

I am new to Fortran and I am comparing ifort's speed to Julia. I would greatly appreciate any help or insight into my issues and taking the time to answer some of my questions.

When running what appears to be similar code, calling LAPACK to diagonalize a large matrix, Julia/OpenBLAS blasts all 4 cores of my laptop (intel i5), while ifort/MKL only use 2 of the 4. If I use the MKL_NUM_THREADS and MKL_DYNAMIC variables, I can force ifort/MKL to use all 4 cores, but the performance actually goes down a little. Even though only 2 cores are running, Fortran still beats Julia/OpenBLAS (not by much though).

Questions: I am just naively watching the CPU activity in the Gnome system monitor. I understand it is possible the Lanczos routines may not be suited for parallel work, but why are 2 cores being used? I would expect either 1 or all cores. Is the CPU doing everything it can for the calculation? What is happening when I set MKL_DYANMIC=false? Is data just being sloshed around the cores in a slow manner, making the cores stay at "100%" but not advancing the calculation?

Side note: Is is expected for MKL_DYNAMIC and other env variables to be undefined, even after sourcing ~/intel/parallel_studio_xe_2019.0.045/psxevars.sh?

Other info: This is on a fresh install of Ubuntu 18.04. I also had to set ulimit -s unlimited to avoid segfaults. The Parallel studio GUI install complained about 32bit libraries, installing libc6-dev-i386 got rid of the warnings. Julia is the standard v1.0.1 build as they describe on their github.

Below are the examples, the Julia eigen() function boils down to calling ?geev(x).

Fortran: compiled with ifort -mkl my_test.f90

program my_test
implicit none
integer::n,info,lwork
real(8) :: test(1)
real(8), allocatable:: M(:,:),ansV(:,:),ansE(:),work(:)
n = 10000
allocate(M(1:n,1:n))
allocate(ansV(1:n,1:n))
allocate(ansE(1:n))
! this first call is just to get the ideal workspace size
! lwork = -1, so dsyev figures out the workspace size (contained in test(1)) and exits
call dsyev('V','U',n,ansV,n,ansE,test,-1,info)
lwork = int(test(1))
allocate(work(1:lwork))
call random_number(M)
M = M + transpose(M)
ansV = M
call dsyev('V','U',n,ansV,n,ansE,work,lwork,info)
print*,ansE
print*,lwork
print*,info
deallocate(M)
deallocate(ansV)
deallocate(ansE)
deallocate(work)
end program my_test
Julia: it is called via julia my_test.jl

using LinearAlgebra
mat = rand(10000,10000)
mat = mat + transpose(mat)
u,v = eigen(mat)
print(u)

Andrew_Smith · ‎10-13-2018

Do you have 4 core via hyperthreading? Most older CPU's dont benefit from using the hyperthreads for intensive maths.That is probably why Intel have only used the real cores by default.

d_3 · ‎10-13-2018

You are spot on, it turns out I have 2 cores which both hyperthread.. I had no idea since Ubuntu does not distinguish between the real and fake cores. I can see how 2 threads is the optimal choice and the basic compilation options will correctly pick that out. Thanks!