Solved: When parallelization breaks vectorization

eliosh · ‎06-18-2009

Hello,

I started to learn Fortran a few days ago. However, the question is regarding Intel compiler, not the language. Playing with the simple matrix multiplication code I noticed that the parallelization sometimes breaks vectorization. Consider a small example that implements two routines for diagonal matrix multiplication: one from the left and another one from the right. In addition I measure the time required for the multiplications.

When compiled without any special flag, the code is vectorized and runs pretty fast

$ ifort -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: LOOP WAS VECTORIZED.
diagmult.f90(33): (col. 8) remark: LOOP WAS VECTORIZED.
$ ./a.out
Initialize = 2.67
Right Mult = 0.25
Left Mult = 0.25

When I add the '-parallel' flag the code is still vectorized, however, the initialization part (random_number()) is much much slower:

$ ifort -parallel -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: LOOP WAS VECTORIZED.
diagmult.f90(33): (col. 8) remark: LOOP WAS VECTORIZED.
$ ./a.out
Initialize = 7.38
Right Mult = 0.24
Left Mult = 0.24

If I enable OpenMP parallelization the code is not vectorized any more. And the whole program is much slower.

$ ifort -openmp -fpp diagmult.f90
diagmult.f90(26): (col. 8) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.f90(26): (col. 8) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
diagmult.f90(33): (col. 8) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.f90(33): (col. 8) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ ./a.out
Initialize = 7.42
Right Mult = 0.31
Left Mult = 0.45

Could you please explain this behavior. Is there any way to enforce vectorization in this example while using OpenMP?

Thank you.

P.S.
To make it clear, I want the operation of the type vector * scalar and vector * vector to vectorized. These appear on lines 56 and 73.

[plain]program diag_mult


  
  implicit none

  integer, parameter :: size =  10000
  real , allocatable :: A(:,:), v(:)
  integer(8) :: clock_start, clock_finish, count_rate
  
 
  allocate(A(size, size))
  allocate(v(size))
  
  

  call system_clock(clock_start)
  call random_number(A)
  call random_number(v)
  call system_clock(clock_finish, count_rate)
  write (*, 10) 'Initialize =', real(clock_finish - clock_start)/count_rate


  
  call system_clock(clock_start)
  call diag_right_mult(A,v)
  call system_clock(clock_finish, count_rate)
  write (*, 10) 'Right Mult =', real(clock_finish - clock_start)/count_rate


  
  call system_clock(clock_start)
  call diag_left_mult(A,v)
  call system_clock(clock_finish, count_rate)
  write (*, 10) 'Left Mult =', real(clock_finish - clock_start)/count_rate



  10 format (A15, F5.2)
  
contains

  subroutine diag_right_mult(A, v)
    real, intent(inout) :: A(:,:)
    real, intent(in):: v(:)
    integer :: i, n

    n = ubound(A,1)
  
    !$omp parallel default(shared)
    !$omp do 
    do i = 1,n
#ifdef USE_MKL
       call sscal(n, v(i), A(:,i), 1)
#else
       A(:,i) = A(:,i) * v(i)
#endif
    end do
    !$omp end do 
    !$omp  end parallel
  end subroutine diag_right_mult


   pure subroutine diag_left_mult(A, v)
    real, intent(inout) :: A(:,:)
    real, intent(in):: v(:)
    integer :: i, n, j
    
    n = ubound(A,2)
    !$omp parallel default(shared)
    !$omp do
    do i = 1,n
          A(:,i) = A(:,i) * v
    end do
    !$omp end do 
    !$omp end parallel
  end subroutine diag_left_mult
  
end program diag_mult[/plain]

TimP · ‎06-19-2009

Quoting - eliosh

Tim, thank you for your help.
I tried all values of the 'inline-max-size' parameter, including 0. Without and success. Actually, it is not clear to me how inlining affects vectorization, besides the line number reported by the compiler.

I use the 64 bit version on Linux.
$ ifort -V
Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

I see the same behavior as you with 11.0.083 for Intel64.
The good news is that my most recent copy of 11.1 beta does restore vectorization, when the inlining is removed. So, you should be able to upgrade within the next 3 weeks and resolve this discrepancy.
I see an improvement in your "Initialize" time with the newer version, but little gain for adding vectorization to the parallel loops at higher CPU speed to memory clock ratios.
In principle, on a NUMA system, you would need to parallelize the initialization with the same static schedule and affinity as the calculations, in order to see the full effect of optimizations.

View solution in original post

TimP · ‎06-18-2009

ifort -O3 -vec-report -openmp-report -openmp -inline-max-size=50 eli.F90
eli.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(73): (col. 11) remark: LOOP WAS VECTORIZED.
eli.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(56): (col. 8) remark: LOOP WAS VECTORIZED.

If you 'd like to question the excessive default setting for inline, you could file a problem report on premier.intel.com.

eliosh · ‎06-18-2009

Quoting - tim18

ifort -O3 -vec-report -openmp-report -openmp -inline-max-size=50 eli.F90
eli.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(73): (col. 11) remark: LOOP WAS VECTORIZED.
eli.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
eli.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
eli.F90(56): (col. 8) remark: LOOP WAS VECTORIZED.

If you 'd like to question the excessive default setting for inline, you could file a problem report on premier.intel.com.

It is strange but, when I run the same command I do not get the vectorizatoin:

$ ifort -O3 -vec-report -openmp-report -openmp -inline-max-size=50 diagmult.F90
diagmult.F90(71): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.F90(70): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
diagmult.F90(51): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
diagmult.F90(50): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

Increasing the inline-max-size does not help.
Just in case, the compiler version I use
$ ifort --version
ifort (IFORT) 11.0 20090318
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.

Can you also explain the performance degradation of random_number() and why the vectorization is canceled when I use the OpenMP flag

Thank you.

jimdempseyatthecove · ‎06-18-2009

The multi-threaded random number generator must use a mutex and/or interlocked compare and swap to protect the updating the random number state. I suggest you write your own suitable multi-threaded random number generator.

Also, OpenMP (depending on implementation) may be creating the thread pool on 1st entry to parallel region. Insert an innocuous parallel region in the front of the program in order to eliminate this one-time overhead.

!$omp parallel
if(omp_get_thread_num() .eq. 9999) pause
!$omp end parallel

Also, to provide for proper test on system where your test app threads may experience preemption make at least three runs through your timed section and take the lowest run time.

Jim Dempsey

eliosh · ‎06-19-2009

Quoting - jimdempseyatthecove

The multi-threaded random number generator must use a mutex and/or interlocked compare and swap to protect the updating the random number state. I suggest you write your own suitable multi-threaded random number generator.

Also, OpenMP (depending on implementation) may be creating the thread pool on 1st entry to parallel region. Insert an innocuous parallel region in the front of the program in order to eliminate this one-time overhead.

!$omp parallel
if(omp_get_thread_num() .eq. 9999) pause
!$omp end parallel

Also, to provide for proper test on system where your test app threads may experience preemption make at least three runs through your timed section and take the lowest run time.

Jim Dempsey

Jim, thank you for your reply.

You are right about the "critical sections" in the random number generator. However, the first OpenMP statements appears *after* the call to random_number().
Moreover, at this moment I am more interested in the matrix operations vectorization.
Thank you again.

TimP · ‎06-19-2009

Your report seems to indicate that you did suppress in-lining, yet that didn't recover vectorization, as it did for me.
If you need the equivalent of inline-max-size=0, there are several ways mentioned in ifort -help, -fno-inline-functions for example. Increasing maximum certainly won't help if you tried one which is already too high.
I suppose there may be a difference between 32- and 64-bit ifort in the thresholds where vectorization is suppressed. It's more common for optimizations to be suppressed in the 32-bit version, due to the lower tolerance of code expansion, but it can happen the other way.
You need ifort -V to show which version.

eliosh · ‎06-19-2009

Quoting - tim18

Your report seems to indicate that you did suppress in-lining, yet that didn't recover vectorization, as it did for me.
If you need the equivalent of inline-max-size=0, there are several ways mentioned in ifort -help, -fno-inline-functions for example. Increasing maximum certainly won't help if you tried one which is already too high.
I suppose there may be a difference between 32- and 64-bit ifort in the thresholds where vectorization is suppressed. It's more common for optimizations to be suppressed in the 32-bit version, due to the lower tolerance of code expansion, but it can happen the other way.
You need ifort -V to show which version.

Tim, thank you for your help.
I tried all values of the 'inline-max-size' parameter, including 0. Without and success. Actually, it is not clear to me how inlining affects vectorization, besides the line number reported by the compiler.

I use the 64 bit version on Linux.
$ ifort -V
Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

TimP · ‎06-19-2009

Quoting - eliosh

Tim, thank you for your help.
I tried all values of the 'inline-max-size' parameter, including 0. Without and success. Actually, it is not clear to me how inlining affects vectorization, besides the line number reported by the compiler.

I use the 64 bit version on Linux.
$ ifort -V
Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

I see the same behavior as you with 11.0.083 for Intel64.
The good news is that my most recent copy of 11.1 beta does restore vectorization, when the inlining is removed. So, you should be able to upgrade within the next 3 weeks and resolve this discrepancy.
I see an improvement in your "Initialize" time with the newer version, but little gain for adding vectorization to the parallel loops at higher CPU speed to memory clock ratios.
In principle, on a NUMA system, you would need to parallelize the initialization with the same static schedule and affinity as the calculations, in order to see the full effect of optimizations.

jimdempseyatthecove · ‎06-19-2009

Eli,

When using IVF 11.0.066 on Windows, both 32-bit and 64-bit platforms, Release Build with SSE3or SSE4 enabled and OpenMP enabledI too experience lack of vectorization in your sample code.

If I modify your code to add my own vector_times_scalar subroutine (which does what sscal does but with the addition of the source forthis routineis a candidate for in-lining) then that routine uses vectorization.

[cpp]  pure subroutine vector_times_scalar(V, S, n)
      integer, intent(in) :: n
      real, intent(in) :: S
      real, intent(inout), dimension(1:n) :: V
      V = V * S
  end subroutine vector_times_scalar
[/cpp]

Jim Dempsey

eliosh · ‎06-19-2009

Dear Jim and Tim,

Thank you for your help. I truly appreciate it.