- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear forum,
Compiling and linking the following simple code with openmp options produces a crash:
Compiling and linking options: -L/opt/intel/composerxe-2013_update2.2.146/mkl/lib/intel64 -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread
program hello
implicit none
integer, parameter :: n=2000
real (kind=8),dimension (n,n) :: a,b,c
integer :: i,j
call random_number(a)
call random_number(b)
j=n/4
!$omp PARALLEL DO
DO i=1,4
c(:,((i-1)*j+1):i*j)=MATMUL(a,b(:,((i-1)*j+1):i*j))
ENDDO
!$omp end PARALLEL DO
!c=matmul(a,b)
print *, 'Hello World!'
end program
You will never see a hello from the program; just segmentation fault. My laptop has 4 GiB of RAM and the stack size is set to unlimited. If n is set to 1000, then it would compile and run fine. If I omit the openmp options and just use plain matmul(a,b), then it will compile well again.
gfortran can compile and finish the same program with an n as large as 10000. Why cannot ifort finish the program?I don't think n=2000 is too many. The strange thing is that ifort can finish much larger and much complicated blocks of code, but cannot finish matmul.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Several steps you should follow to check this out are described in
http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors
It seems that ifort's -opt-matmul and gfortran's -fexternal-blas (carefully used) would be appropriate. Both of those would probably use your 4 cores more effectively inside the library matrix multiply function, so you would not want to use OpenMP explicitly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for your swift reply.
Following you valuable piece of advice, I used matmul from libmatmul.a library which seems to have the same speed of utilizing ?gemm from lapack library (Are they the same?). The code in "matmul"ing is significantly faster. However, it still does not utilize all 4 available threads (2 cores x 2 threads) and only 2 threads are used. Is ifort aware of hyperthreading? Because it seems that it does not use 2 more virtual cores (or maybe just uses one complete core) and the cpu usage is stuck at around 53%.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's been a while since the subject of MKL usage of HyperThreading was discussed on the MKL forum. To make it short, the MKL functions which don't benefit from HT will default to just one thread per core. MKL gets nearly 100% use of the floating point units this way.
If your goal is to peg your performance meter at the expense of performance, the MKL_DYNAMIC environment variable should help.
The library function which supports opt-matmul seems to be a separate entry point in the MKL library, probably using some of the same internal functions as ?gemm. With gfortran, if you call dgemm by the -fexternal-blas route, the "C" (3rd) matrix sent to dgemm is not initialized until dgemm does it (recognizing the beta==0. argument as a signal).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ket T.,
Consider what TimP has to say "If your goal is to peg your performance meter at the expense of performance..."
A two core processor with HT has only two floating point units (one per core). When one of the HT threads of a core can fully utilize its floating point bandwidth, the adding the second HT thread of the core also using the floating point unit does not increase floating point throughput, but can often be counter productive by causing unintended L1, L2 cache evictions, thus slowing down otherwise efficient coding. MKL has some/many functions that have been determined to run fastest using one thread per core. Other applications (or MKL functions) that do not use 100% of the floating point computational ability (of that core) may exicute faster using all (more) threads of the same core.
To rephrase TimP:
Do you want a CPU usage meter to show 100% usage?
.OR.
Do you want the function to execute in shorter time?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you TimP and jimdempseyatthecove,
I always do many benchmarks and let me admit something: I have been impressed by the speed of ifort and MKL. I am on a native, sleek gentoo with a well-built Atlas. I have a code as follows:
Real(kind=8), dimension(3200,3200) :: a,b,c
......
.....
call solve(a,b) !solves ax=b (by dgesv)
c=matmul(a,b)
Gfortran 4.7.2+Atlas+Openmp:Best case: The first line finishes in 10.98. The second line in 24.52=35.5 seconds
ifort+mkl (-lmatmul -parallel `pkg-config --libs lapack`) :6.37+7.87 =14.24 (4 cores are active, 100% cpu usage. Why? I did not set anything.)
ifort+mkl ( -parallel `pkg-config --libs lapack`) : 6.41+3.82=10.23 (2 cores seem to be active)
Results:
1) Good job intel. Really Good job! 350% faster. That's just *awesome*. How dddddid yyouu ddo thatttt??!
2) Why did implmenting -lmatmul decrease matmul performance? Why did it utilize HT? I did not use dynamic mkl. Weird.
pkg-config --libs lapack=-L/opt/intel/composerxe-2013_update2.2.146/mkl/lib/intel64 -Wl,--start-group -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any answer to my second question?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page