Performance issue

velvia · ‎05-17-2012

Hi,

I have a problem with the Intel Fortran compilers giving different performance at one point of my program depending on the version of the compiler used. With the version composer_xe_2011_sp1.9.293, I get a hot spot (It spends 20s) at this line:

pw = theCellState%pw(vk)

where theCellState is an object, pw a method on that object (which is pure), and whose code is:

pure function pw(this,i)
class(cellStateArray), intent(in) :: this
integer, intent(in) :: i
real(fp) :: pw

pw = this%pw_(i)
end function pw

where this%pw_ is an array, member of the object.

If you look at the assembly code (generated by composer_xe_2011_sp1.9.293) for this line, I get:

mov %rdx, %rsi
movq %r8, 0x3463b6(%rip)
movq %rdi, 0x3463b7(%rip)
...
and more than 100 lines or movq !

If I compile the program with composer_xe_2011_sp1.7.256 or composer_xe_2011_sp1.6.233, the time spent at this line comes back to 0.036s, and the assembly code is just

movq 0x3a0(%rsi), %r15
shl $0x3, %r15
neg %r15
addq 0x360(%rsi), %r15
movq %r12, 0x68(%rsp)

Do you have any idea where the problem comes from ?

Best regards,
Francois

jimdempseyatthecove · ‎05-17-2012

Francois,

I suspect in one case (slow case) you are in Debug build and you have array bounds checking enabled.
And in the other case you do not have array bounds checking enabled. (and/or function is inlined).

Jim Dempsey

velvia · ‎05-18-2012

Jim,

Both cases are compiled with the same Makefile and the same options:
-O3 -ipo -no-prec-div -xHost -assume realloc_lhs -g -opt-report
So I don't think the problem is about compilation options, including bound checking.

Concerning inlining, as suggested, I've asked reports to the compiler for both versions, and I get the same conclusion from the intel compiler in both cases. So it seems that everything is inlined in both cases.

What I found interesting is that I have made the following change:
*****
- Changed the member pw_ from private to public
- I have changed the line
pw = theCellState%pw(vk)
to the line
pw = theCellState%pw_(vk)
*****
And the hotspot dissapear from this line (and goes to the next getter, and as always only the next one, not the following ones).
So there is really something about accessing the value of a member through a getter. But is does not seems to be an inlining problem.

Maybe I can post the full assembly code for this line. Would it be helpful?

Very strange.
Francois

velvia · ‎05-18-2012

Here is the assembly for the code compiled with 12.1.3 (The version that run slower than the other ones).

It is amazing that one line of code, that calls a function that is just a getter and that seems to be inlined according to -opt-report, can generate 440 lines of assembly code!

jimdempseyatthecove · ‎05-18-2012

The code is not inlined.... it is trashed.

Either the compiler generated bad code.....
Or your program overwrote the code.

Place a break point on the first instruction in PROGRAM

Open up Dissassembly window and examine

0x4cb578 (use Ctrl-G in Dissassembly window)
*** this assumes code still at 0x4cb578 as listed in the dump you sent
*** if not, find new current location, then restart with break at start of program and examine code

If the code is as what was listed (excepting for offsets on jle 0x......) then compiler error. Report to Intel.

If code looks like proper code for inlined or outlined code of your function, then your program is overwriting your program. To find out where, set a data change break point in one of the locations that gets modified (set data break and enter hex address).

Note, in optimized code, figuring out line numbers takes a little bit of detective work.

Jim Dempsey

velvia · ‎05-18-2012

Hi Jim,

Thank you very much for the help. I am very novice in assembly code, but here is what I did.

- I first ran vTune amplifier on the code generated by ifort 12.1.3 (the one giving slower code). Then I've looked at the strange hotspot and it's corresponding assembly code (It is at a different line than before as I have changed these getters into a direct access to the members):
rhoKrMu = theCellState%rhow(vk)
xor %r8d, %r8d
mov $0x570070, %r14d
movq %rax, 0x3454b25%(rip)
... and then about 400 lines of movq instructions.
The first line of assembly code is at address 0x4cb596

- Then, I ran idb and I set up a break point at the first line of my code. I run the code that stops at my break point. I then go to View-Assembler, and I set the starting line to be 0x4cb596. Here are the lines I get:
xor r8d, r8d
mov r14d, 0x570070
mov qword ptr {rip+0x3454b2}, rax
mov qword ptr {rip+0x3454b3}, rdx
.... and many lines (at least all the following 40 lines) beginning with move qword ptr.

If I undestand you well, it seems to be bug of the compiler.

What can I do so that Intel can fix it? The problem appears on Mac OS X and Linux (I don't have Windows). And it also appears in 12.1.4 and 13.0.0 released today.

Francois

velvia · ‎05-18-2012

I made some tests on Mac OS X. And it seems to appear in 12.1.3:

The problem is not there on versions 12.1.0, 12.1.2 and it is there on versions 12.1.3, 12.1.4 and 13.0.0 (the one published today).

Francois

jimdempseyatthecove · ‎05-18-2012

Francois,

If you can make a simplified reproducer that would help.
Submit it together with information about your compiler version, OS, CPU, anything else to help Intel (or others) to replicate your problem.

Also, for this forum, can you post cellStateArray?

Jim Dempsey

jimdempseyatthecove · ‎05-18-2012

I forgot to ask:

When using the compiler version that produces the error, what happens when you depreciate the optimization levels?

Also with/withou IPO (Inter-Procedural Optimization).

Jim Dempsey

velvia · ‎05-18-2012

Hi Jim,

I made some progress:
- The problem is there wether or not IPO is there, and with optimization -O2 and -O3.
- I have realized that the problem is always with the first call to a method of theCellState that appears in the source code. As in my code, the first call is inside a loop, I get better performance if I first call a method outside of the loop for nothing.

Otherwise, I can't get a small case yet. This problem is very tricky to hunt.

By the way, what do you mean when you say "The code is trashed" ?

Francois

jimdempseyatthecove · ‎05-19-2012

>>I have realized that the problem is always with the first call to a method of theCellState that appears in the source code. As in my code, the first call is inside a loop, I get better performance if I first call a method outside of the loop for nothing.

Does your class have initialization code?
Is the object a dummy arg (subroutine arg) or local to subroutine/function?

>>By the way, what do you mean when you say "The code is trashed" ?

ArrayA = something

Where the array descriptor of ArrayA is not initialized or is no longer valid.
This often occurs in multi-threaded applications where the threads are not properly coordinated with respect to the lifetime of the array descriptor.

This can trash code or data or unused memory or Sef Fault

Array(indexOutOfBounds) = something

! recursive or OpenMP or option in effect
! to place subroutine scoped array on stack
subroutine foo(...)
someType :: Array(N)
someOtherType, pointer :: p
...
p => somewhereValid
Array(N+1) = something ! trashes pointer p
p = something ! memory trashed

There are many other situations. The Debug runtime checks for uninitialized variables used before initialization and subscripts out of bounds checks catch most of the coding errors but not all such errors.

velvia · ‎05-19-2012

I finally got it.

Here is a simple case where the problem shows up.

[fortran]program main use cellStateArray_module, only : cellStateArray implicit none type(cellStateArray), allocatable :: myCellState real(8), allocatable, dimension(:) :: myPw real(8) :: total integer :: n, k n = 50000 allocate(myPw(1:n)) myPw = 1.0 allocate(myCellState) call myCellState%init(pw = myPw) total = 0.0 do k = 1, 10000000 total = total + essai(myCellState) end do write (*,*) total contains pure function essai(theCellStateArray) type(cellStateArray), intent(in) :: theCellStateArray real(8) :: essai essai = theCellStateArray%pw(1) end function essai end program main[/fortran] with the following module

[fortran]module cellStateArray_module implicit none private type, public :: cellStateArray private real(8), allocatable, dimension(:) :: pw_ !real(8), allocatable, dimension(:) :: a_ !real(8), allocatable, dimension(:) :: b_ !real(8), allocatable, dimension(:) :: c_ !real(8), allocatable, dimension(:) :: d_ !real(8), allocatable, dimension(:) :: e_ !real(8), allocatable, dimension(:) :: f_ !real(8), allocatable, dimension(:) :: g_ !real(8), allocatable, dimension(:) :: h_ !real(8), allocatable, dimension(:) :: i_ !real(8), allocatable, dimension(:) :: j_ !real(8), allocatable, dimension(:) :: k_ !real(8), allocatable, dimension(:) :: l_ !real(8), allocatable, dimension(:) :: m_ !real(8), allocatable, dimension(:) :: n_ !real(8), allocatable, dimension(:) :: o_ !real(8), allocatable, dimension(:) :: p_ !real(8), allocatable, dimension(:) :: q_ !real(8), allocatable, dimension(:) :: r_ !real(8), allocatable, dimension(:) :: s_ !real(8), allocatable, dimension(:) :: t_ !real(8), allocatable, dimension(:) :: u_ !real(8), allocatable, dimension(:) :: v_ !real(8), allocatable, dimension(:) :: w_ !real(8), allocatable, dimension(:) :: x_ !real(8), allocatable, dimension(:) :: y_ !real(8), allocatable, dimension(:) :: z_ contains private procedure, public :: init ! Getters procedure, public :: pw end type cellStateArray contains subroutine init(this, pw) class(cellStateArray), intent(inout) :: this real(8), allocatable, dimension(:), intent(in) :: pw allocate(this%pw_(1:size(pw))) this%pw_ = pw end subroutine init pure function pw(this, i) class(cellStateArray), intent(in) :: this integer, intent(in) :: i real(8) :: pw pw = this%pw_(i) end function pw end module cellStateArray_module[/fortran]
This program runs fine with ifort 12.1.3 and ifort 12.1.2. But as soon as you uncomment the members a_,..., z_, the size of the code generared by 12.1.3 (and versions above) for the code essai = theCellStateArray%pw(1) gets huge (It seems to be proportional to the number of members of the class cellStateArray) and the run time is multiplied by 10. Whereas with ifort 12.1.2, everything runs at the same speed.

I found this problem on Linux and Mac OS X.

Francois

velvia · ‎05-21-2012

Hello,

Is anyone at Intel able to reproduce this problem?

Best regards,

Franois

velvia · ‎05-23-2012

Hello,

I am sorry to insist but I have spent 2 days of work in order to provide a simple test case. This issue is a major problem in terms of performance for our code.

Therefore, I would be very disapointed if the Intel team does not react to this post.

Franois

Anonymous66 · ‎05-24-2012

Hello Francois,

Thank you for the small test case. I was able to reproduce the issue and have escalated it to the developers. The issue number is DPD200232355. I will keep you informed of any updates I receive on this issue.

Regards,
Annalee
Intel Developer Support

velvia · ‎05-24-2012

Hello Annalee,

Thanks for taking care of that.

Best regards,

Franois Fayard

mecej4 · ‎05-25-2012

Apart from all the points covered by Jim's and your posts in this thread, I notice something very suspicious in the lines

movq %r8, 0x3463b6(%rip)
movq %rdi, 0x3463b7(%rip)

Each instruction is writing eight bytes of memory. However, the second of these two instructions is overwriting the the last seven bytes written by the first instruction.

This is rather unusual. I would have expected the target addresses to differ by at least the register size, namely, eight bytes -- had the second address been 0x3463be(%rip), that would have been more reasonable.

Please consider making up a small "reproducer" test program.

velvia · ‎05-30-2012

mecej4,

I've already made a "reproducer" program which has been posted above. Does it show the problem you were looking for?

Franois

mecej4 · ‎05-31-2012

Thanks, I can see the main problem (numerous moves between registers and memory, for all the members of cellStateArray, whether needed or not).

At present, I am running Windows 7, and on this platform the compiler does not produce any instructions that contain instruction-pointer relative memory references (%rip+...).

Anonymous66 · ‎06-06-2012

Hello Franois,

The developers have informed me that the performance regression is the result of an important bug fix. They may look into improving the performance in the future, but the changes that resulted in this performance regression will not be undone.

Regards,
Annalee
Intel Developer Support