Including some Openmp compiling and linking options makes matmul() crash

Ket_T_ · ‎03-12-2013

Dear forum,

Compiling and linking the following simple code with openmp options produces a crash:

Compiling and linking options: -L/opt/intel/composerxe-2013_update2.2.146/mkl/lib/intel64 -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread

program hello
    implicit none
    integer, parameter :: n=2000
    real (kind=8),dimension (n,n) :: a,b,c
    integer :: i,j
    call random_number(a)
    call random_number(b)
    j=n/4
    !$omp PARALLEL DO

    DO i=1,4
        c(:,((i-1)*j+1):i*j)=MATMUL(a,b(:,((i-1)*j+1):i*j))
    ENDDO
    !$omp end PARALLEL DO

    !c=matmul(a,b)
    print *, 'Hello World!'

end program

You will never see a hello from the program; just segmentation fault. My laptop has 4 GiB of RAM and the stack size is set to unlimited. If n is set to 1000, then it would compile and run fine. If I omit the openmp options and just use plain matmul(a,b), then it will compile well again.

gfortran can compile and finish the same program with an n as large as 10000. Why cannot ifort finish the program?I don't think n=2000 is too many. The strange thing is that ifort can finish much larger and much complicated blocks of code, but cannot finish matmul.

TimP · ‎03-12-2013

Several steps you should follow to check this out are described in

http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors

It seems that ifort's -opt-matmul and gfortran's -fexternal-blas (carefully used) would be appropriate. Both of those would probably use your 4 cores more effectively inside the library matrix multiply function, so you would not want to use OpenMP explicitly.

Ket_T_ · ‎03-12-2013

Thank you very much for your swift reply.

Following you valuable piece of advice, I used matmul from libmatmul.a library which seems to have the same speed of utilizing ?gemm from lapack library (Are they the same?). The code in "matmul"ing is significantly faster. However, it still does not utilize all 4 available threads (2 cores x 2 threads) and only 2 threads are used. Is ifort aware of hyperthreading? Because it seems that it does not use 2 more virtual cores (or maybe just uses one complete core) and the cpu usage is stuck at around 53%.

TimP · ‎03-12-2013

It's been a while since the subject of MKL usage of HyperThreading was discussed on the MKL forum. To make it short, the MKL functions which don't benefit from HT will default to just one thread per core. MKL gets nearly 100% use of the floating point units this way.

If your goal is to peg your performance meter at the expense of performance, the MKL_DYNAMIC environment variable should help.

The library function which supports opt-matmul seems to be a separate entry point in the MKL library, probably using some of the same internal functions as ?gemm. With gfortran, if you call dgemm by the -fexternal-blas route, the "C" (3rd) matrix sent to dgemm is not initialized until dgemm does it (recognizing the beta==0. argument as a signal).

jimdempseyatthecove · ‎03-13-2013

Ket T.,

Consider what TimP has to say "If your goal is to peg your performance meter at the expense of performance..."

A two core processor with HT has only two floating point units (one per core). When one of the HT threads of a core can fully utilize its floating point bandwidth, the adding the second HT thread of the core also using the floating point unit does not increase floating point throughput, but can often be counter productive by causing unintended L1, L2 cache evictions, thus slowing down otherwise efficient coding. MKL has some/many functions that have been determined to run fastest using one thread per core. Other applications (or MKL functions) that do not use 100% of the floating point computational ability (of that core) may exicute faster using all (more) threads of the same core.

To rephrase TimP:

Do you want a CPU usage meter to show 100% usage?
.OR.
Do you want the function to execute in shorter time?

Jim Dempsey

Ket_T_ · ‎03-13-2013

Thank you TimP and jimdempseyatthecove,

I always do many benchmarks and let me admit something: I have been impressed by the speed of ifort and MKL. I am on a native, sleek gentoo with a well-built Atlas. I have a code as follows:

Real(kind=8), dimension(3200,3200) :: a,b,c

......

.....

call solve(a,b) !solves ax=b (by dgesv)

c=matmul(a,b)

Gfortran 4.7.2+Atlas+Openmp:Best case: The first line finishes in 10.98. The second line in 24.52=35.5 seconds

ifort+mkl (-lmatmul -parallel `pkg-config --libs lapack`) :6.37+7.87 =14.24 (4 cores are active, 100% cpu usage. Why? I did not set anything.)

ifort+mkl ( -parallel `pkg-config --libs lapack`) : 6.41+3.82=10.23 (2 cores seem to be active)

Results:

1) Good job intel. Really Good job! 350% faster. That's just *awesome*. How dddddid yyouu ddo thatttt??!

2) Why did implmenting -lmatmul decrease matmul performance? Why did it utilize HT? I did not use dynamic mkl. Weird.

pkg-config --libs lapack=-L/opt/intel/composerxe-2013_update2.2.146/mkl/lib/intel64 -Wl,--start-group -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread

Ket_T_ · ‎03-18-2013

Any answer to my second question?