- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I started to learn Fortran a few days ago. However, the question is regarding Intel compiler, not the language. Playing with the simple matrix multiplication code I noticed that the parallelization sometimes breaks vectorization. Consider a small example that implements two routines for diagonal matrix multiplication: one from the left and another one from the right. In addition I measure the time required for the multiplications.
When compiled without any special flag, the code is vectorized and runs pretty fast
When I add the '-parallel' flag the code is still vectorized, however, the initialization part (random_number()) is much much slower:
$ ifort -parallel -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: LOOP WAS VECTORIZED.
diagmult.f90(33): (col. 8) remark: LOOP WAS VECTORIZED.
$ ./a.out
Initialize = 7.38
Right Mult = 0.24
Left Mult = 0.24
If I enable OpenMP parallelization the code is not vectorized any more. And the whole program is much slower.
$ ifort -openmp -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.f90(26): (col. 8) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
diagmult.f90(33): (col. 8) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.f90(33): (col. 8) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ ./a.out
Initialize = 7.42
Right Mult = 0.31
Left Mult = 0.45
Could you please explain this behavior. Is there any way to enforce vectorization in this example while using OpenMP?
Thank you.
P.S.
To make it clear, I want the operation of the type vector * scalar and vector * vector to vectorized. These appear on lines 56 and 73.
I started to learn Fortran a few days ago. However, the question is regarding Intel compiler, not the language. Playing with the simple matrix multiplication code I noticed that the parallelization sometimes breaks vectorization. Consider a small example that implements two routines for diagonal matrix multiplication: one from the left and another one from the right. In addition I measure the time required for the multiplications.
When compiled without any special flag, the code is vectorized and runs pretty fast
$ ifort -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: LOOP WAS VECTORIZED.
diagmult.f90(33): (col. 8) remark: LOOP WAS VECTORIZED.
$ ./a.out
Initialize = 2.67
Right Mult = 0.25
Left Mult = 0.25
diagmult.f90(26): (col. 8) remark: LOOP WAS VECTORIZED.
diagmult.f90(33): (col. 8) remark: LOOP WAS VECTORIZED.
$ ./a.out
Initialize = 2.67
Right Mult = 0.25
Left Mult = 0.25
When I add the '-parallel' flag the code is still vectorized, however, the initialization part (random_number()) is much much slower:
$ ifort -parallel -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: LOOP WAS VECTORIZED.
diagmult.f90(33): (col. 8) remark: LOOP WAS VECTORIZED.
$ ./a.out
Initialize = 7.38
Right Mult = 0.24
Left Mult = 0.24
If I enable OpenMP parallelization the code is not vectorized any more. And the whole program is much slower.
$ ifort -openmp -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.f90(26): (col. 8) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
diagmult.f90(33): (col. 8) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.f90(33): (col. 8) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ ./a.out
Initialize = 7.42
Right Mult = 0.31
Left Mult = 0.45
Could you please explain this behavior. Is there any way to enforce vectorization in this example while using OpenMP?
Thank you.
P.S.
To make it clear, I want the operation of the type vector * scalar and vector * vector to vectorized. These appear on lines 56 and 73.
[plain]program diag_mult implicit none integer, parameter :: size = 10000 real , allocatable :: A(:,:), v(:) integer(8) :: clock_start, clock_finish, count_rate allocate(A(size, size)) allocate(v(size)) call system_clock(clock_start) call random_number(A) call random_number(v) call system_clock(clock_finish, count_rate) write (*, 10) 'Initialize =', real(clock_finish - clock_start)/count_rate call system_clock(clock_start) call diag_right_mult(A,v) call system_clock(clock_finish, count_rate) write (*, 10) 'Right Mult =', real(clock_finish - clock_start)/count_rate call system_clock(clock_start) call diag_left_mult(A,v) call system_clock(clock_finish, count_rate) write (*, 10) 'Left Mult =', real(clock_finish - clock_start)/count_rate 10 format (A15, F5.2) contains subroutine diag_right_mult(A, v) real, intent(inout) :: A(:,:) real, intent(in):: v(:) integer :: i, n n = ubound(A,1) !$omp parallel default(shared) !$omp do do i = 1,n #ifdef USE_MKL call sscal(n, v(i), A(:,i), 1) #else A(:,i) = A(:,i) * v(i) #endif end do !$omp end do !$omp end parallel end subroutine diag_right_mult pure subroutine diag_left_mult(A, v) real, intent(inout) :: A(:,:) real, intent(in):: v(:) integer :: i, n, j n = ubound(A,2) !$omp parallel default(shared) !$omp do do i = 1,n A(:,i) = A(:,i) * v end do !$omp end do !$omp end parallel end subroutine diag_left_mult end program diag_mult[/plain]
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - eliosh
Tim, thank you for your help.
I tried all values of the 'inline-max-size' parameter, including 0. Without and success. Actually, it is not clear to me how inlining affects vectorization, besides the line number reported by the compiler.
I use the 64 bit version on Linux.
$ ifort -V
Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY
The good news is that my most recent copy of 11.1 beta does restore vectorization, when the inlining is removed. So, you should be able to upgrade within the next 3 weeks and resolve this discrepancy.
I see an improvement in your "Initialize" time with the newer version, but little gain for adding vectorization to the parallel loops at higher CPU speed to memory clock ratios.
In principle, on a NUMA system, you would need to parallelize the initialization with the same static schedule and affinity as the calculations, in order to see the full effect of optimizations.
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ifort -O3 -vec-report -openmp-report -openmp -inline-max-size=50 eli.F90
eli.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(73): (col. 11) remark: LOOP WAS VECTORIZED.
eli.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(56): (col. 8) remark: LOOP WAS VECTORIZED.
If you 'd like to question the excessive default setting for inline, you could file a problem report on premier.intel.com.
eli.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(73): (col. 11) remark: LOOP WAS VECTORIZED.
eli.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(56): (col. 8) remark: LOOP WAS VECTORIZED.
If you 'd like to question the excessive default setting for inline, you could file a problem report on premier.intel.com.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
ifort -O3 -vec-report -openmp-report -openmp -inline-max-size=50 eli.F90
eli.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(73): (col. 11) remark: LOOP WAS VECTORIZED.
eli.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(56): (col. 8) remark: LOOP WAS VECTORIZED.
If you 'd like to question the excessive default setting for inline, you could file a problem report on premier.intel.com.
eli.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(73): (col. 11) remark: LOOP WAS VECTORIZED.
eli.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(56): (col. 8) remark: LOOP WAS VECTORIZED.
If you 'd like to question the excessive default setting for inline, you could file a problem report on premier.intel.com.
$ ifort -O3 -vec-report -openmp-report -openmp -inline-max-size=50 diagmult.F90
diagmult.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
diagmult.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
Increasing the inline-max-size does not help.
Just in case, the compiler version I use
$ ifort --version
ifort (IFORT) 11.0 20090318
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
Can you also explain the performance degradation of random_number() and why the vectorization is canceled when I use the OpenMP flag
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The multi-threaded random number generator must use a mutex and/or interlocked compare and swap to protect the updating the random number state. I suggest you write your own suitable multi-threaded random number generator.
Also, OpenMP (depending on implementation) may be creating the thread pool on 1st entry to parallel region. Insert an innocuous parallel region in the front of the program in order to eliminate this one-time overhead.
!$omp parallel
if(omp_get_thread_num() .eq. 9999) pause
!$omp end parallel
Also, to provide for proper test on system where your test app threads may experience preemption make at least three runs through your timed section and take the lowest run time.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
The multi-threaded random number generator must use a mutex and/or interlocked compare and swap to protect the updating the random number state. I suggest you write your own suitable multi-threaded random number generator.
Also, OpenMP (depending on implementation) may be creating the thread pool on 1st entry to parallel region. Insert an innocuous parallel region in the front of the program in order to eliminate this one-time overhead.
!$omp parallel
if(omp_get_thread_num() .eq. 9999) pause
!$omp end parallel
Also, to provide for proper test on system where your test app threads may experience preemption make at least three runs through your timed section and take the lowest run time.
Jim Dempsey
Jim, thank you for your reply.
You are right about the "critical sections" in the random number generator. However, the first OpenMP statements appears *after* the call to random_number().
Moreover, at this moment I am more interested in the matrix operations vectorization.
Thank you again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your report seems to indicate that you did suppress in-lining, yet that didn't recover vectorization, as it did for me.
If you need the equivalent of inline-max-size=0, there are several ways mentioned in ifort -help, -fno-inline-functions for example. Increasing maximum certainly won't help if you tried one which is already too high.
I suppose there may be a difference between 32- and 64-bit ifort in the thresholds where vectorization is suppressed. It's more common for optimizations to be suppressed in the 32-bit version, due to the lower tolerance of code expansion, but it can happen the other way.
You need ifort -V to show which version.
If you need the equivalent of inline-max-size=0, there are several ways mentioned in ifort -help, -fno-inline-functions for example. Increasing maximum certainly won't help if you tried one which is already too high.
I suppose there may be a difference between 32- and 64-bit ifort in the thresholds where vectorization is suppressed. It's more common for optimizations to be suppressed in the 32-bit version, due to the lower tolerance of code expansion, but it can happen the other way.
You need ifort -V to show which version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Your report seems to indicate that you did suppress in-lining, yet that didn't recover vectorization, as it did for me.
If you need the equivalent of inline-max-size=0, there are several ways mentioned in ifort -help, -fno-inline-functions for example. Increasing maximum certainly won't help if you tried one which is already too high.
I suppose there may be a difference between 32- and 64-bit ifort in the thresholds where vectorization is suppressed. It's more common for optimizations to be suppressed in the 32-bit version, due to the lower tolerance of code expansion, but it can happen the other way.
You need ifort -V to show which version.
If you need the equivalent of inline-max-size=0, there are several ways mentioned in ifort -help, -fno-inline-functions for example. Increasing maximum certainly won't help if you tried one which is already too high.
I suppose there may be a difference between 32- and 64-bit ifort in the thresholds where vectorization is suppressed. It's more common for optimizations to be suppressed in the 32-bit version, due to the lower tolerance of code expansion, but it can happen the other way.
You need ifort -V to show which version.
Tim, thank you for your help.
I tried all values of the 'inline-max-size' parameter, including 0. Without and success. Actually, it is not clear to me how inlining affects vectorization, besides the line number reported by the compiler.
I use the 64 bit version on Linux.
$ ifort -V
Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - eliosh
Tim, thank you for your help.
I tried all values of the 'inline-max-size' parameter, including 0. Without and success. Actually, it is not clear to me how inlining affects vectorization, besides the line number reported by the compiler.
I use the 64 bit version on Linux.
$ ifort -V
Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY
The good news is that my most recent copy of 11.1 beta does restore vectorization, when the inlining is removed. So, you should be able to upgrade within the next 3 weeks and resolve this discrepancy.
I see an improvement in your "Initialize" time with the newer version, but little gain for adding vectorization to the parallel loops at higher CPU speed to memory clock ratios.
In principle, on a NUMA system, you would need to parallelize the initialization with the same static schedule and affinity as the calculations, in order to see the full effect of optimizations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Eli,
When using IVF 11.0.066 on Windows, both 32-bit and 64-bit platforms, Release Build with SSE3or SSE4 enabled and OpenMP enabledI too experience lack of vectorization in your sample code.
If I modify your code to add my own vector_times_scalar subroutine (which does what sscal does but with the addition of the source forthis routineis a candidate for in-lining) then that routine uses vectorization.
[cpp] pure subroutine vector_times_scalar(V, S, n) integer, intent(in) :: n real, intent(in) :: S real, intent(inout), dimension(1:n) :: V V = V * S end subroutine vector_times_scalar [/cpp]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim and Tim,
Thank you for your help. I truly appreciate it.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page