- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a problem with the Intel Fortran compilers giving different performance at one point of my program depending on the version of the compiler used. With the version composer_xe_2011_sp1.9.293, I get a hot spot (It spends 20s) at this line:
pw = theCellState%pw(vk)
where theCellState is an object, pw a method on that object (which is pure), and whose code is:
pure function pw(this,i)
class(cellStateArray), intent(in) :: this
integer, intent(in) :: i
real(fp) :: pw
pw = this%pw_(i)
end function pw
where this%pw_ is an array, member of the object.
If you look at the assembly code (generated by composer_xe_2011_sp1.9.293) for this line, I get:
mov %rdx, %rsi
movq %r8, 0x3463b6(%rip)
movq %rdi, 0x3463b7(%rip)
...
and more than 100 lines or movq !
If I compile the program with composer_xe_2011_sp1.7.256 or composer_xe_2011_sp1.6.233, the time spent at this line comes back to 0.036s, and the assembly code is just
movq 0x3a0(%rsi), %r15
shl $0x3, %r15
neg %r15
addq 0x360(%rsi), %r15
movq %r12, 0x68(%rsp)
Do you have any idea where the problem comes from ?
Best regards,
Francois
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suspect in one case (slow case) you are in Debug build and you have array bounds checking enabled.
And in the other case you do not have array bounds checking enabled. (and/or function is inlined).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Both cases are compiled with the same Makefile and the same options:
-O3 -ipo -no-prec-div -xHost -assume realloc_lhs -g -opt-report
So I don't think the problem is about compilation options, including bound checking.
Concerning inlining, as suggested, I've asked reports to the compiler for both versions, and I get the same conclusion from the intel compiler in both cases. So it seems that everything is inlined in both cases.
What I found interesting is that I have made the following change:
*****
- Changed the member pw_ from private to public
- I have changed the line
pw = theCellState%pw(vk)
to the line
pw = theCellState%pw_(vk)
*****
And the hotspot dissapear from this line (and goes to the next getter, and as always only the next one, not the following ones).
So there is really something about accessing the value of a member through a getter. But is does not seems to be an inlining problem.
Maybe I can post the full assembly code for this line. Would it be helpful?
Very strange.
Francois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Either the compiler generated bad code.....
Or your program overwrote the code.
Place a break point on the first instruction in PROGRAM
Open up Dissassembly window and examine
0x4cb578 (use Ctrl-G in Dissassembly window)
*** this assumes code still at 0x4cb578 as listed in the dump you sent
*** if not, find new current location, then restart with break at start of program and examine code
If the code is as what was listed (excepting for offsets on jle 0x......) then compiler error. Report to Intel.
If code looks like proper code for inlined or outlined code of your function, then your program is overwriting your program. To find out where, set a data change break point in one of the locations that gets modified (set data break and enter hex address).
Note, in optimized code, figuring out line numbers takes a little bit of detective work.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for the help. I am very novice in assembly code, but here is what I did.
- I first ran vTune amplifier on the code generated by ifort 12.1.3 (the one giving slower code). Then I've looked at the strange hotspot and it's corresponding assembly code (It is at a different line than before as I have changed these getters into a direct access to the members):
rhoKrMu = theCellState%rhow(vk)
xor %r8d, %r8d
mov $0x570070, %r14d
movq %rax, 0x3454b25%(rip)
... and then about 400 lines of movq instructions.
The first line of assembly code is at address 0x4cb596
- Then, I ran idb and I set up a break point at the first line of my code. I run the code that stops at my break point. I then go to View-Assembler, and I set the starting line to be 0x4cb596. Here are the lines I get:
xor r8d, r8d
mov r14d, 0x570070
mov qword ptr {rip+0x3454b2}, rax
mov qword ptr {rip+0x3454b3}, rdx
.... and many lines (at least all the following 40 lines) beginning with move qword ptr.
If I undestand you well, it seems to be bug of the compiler.
What can I do so that Intel can fix it? The problem appears on Mac OS X and Linux (I don't have Windows). And it also appears in 12.1.4 and 13.0.0 released today.
Francois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you can make a simplified reproducer that would help.
Submit it together with information about your compiler version, OS, CPU, anything else to help Intel (or others) to replicate your problem.
Also, for this forum, can you post cellStateArray?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When using the compiler version that produces the error, what happens when you depreciate the optimization levels?
Also with/withou IPO (Inter-Procedural Optimization).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I made some progress:
- The problem is there wether or not IPO is there, and with optimization -O2 and -O3.
- I have realized that the problem is always with the first call to a method of theCellState that appears in the source code. As in my code, the first call is inside a loop, I get better performance if I first call a method outside of the loop for nothing.
Otherwise, I can't get a small case yet. This problem is very tricky to hunt.
By the way, what do you mean when you say "The code is trashed" ?
Francois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does your class have initialization code?
Is the object a dummy arg (subroutine arg) or local to subroutine/function?
>>By the way, what do you mean when you say "The code is trashed" ?
ArrayA = something
Where the array descriptor of ArrayA is not initialized or is no longer valid.
This often occurs in multi-threaded applications where the threads are not properly coordinated with respect to the lifetime of the array descriptor.
This can trash code or data or unused memory or Sef Fault
Array(indexOutOfBounds) = something
! recursive or OpenMP or option in effect
! to place subroutine scoped array on stack
subroutine foo(...)
someType :: Array(N)
someOtherType, pointer :: p
...
p => somewhereValid
Array(N+1) = something ! trashes pointer p
p = something ! memory trashed
There are many other situations. The Debug runtime checks for uninitialized variables used before initialization and subscripts out of bounds checks catch most of the coding errors but not all such errors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a simple case where the problem shows up.
[fortran]program main use cellStateArray_module, only : cellStateArray implicit none type(cellStateArray), allocatable :: myCellState real(8), allocatable, dimension(:) :: myPw real(8) :: total integer :: n, k n = 50000 allocate(myPw(1:n)) myPw = 1.0 allocate(myCellState) call myCellState%init(pw = myPw) total = 0.0 do k = 1, 10000000 total = total + essai(myCellState) end do write (*,*) total contains pure function essai(theCellStateArray) type(cellStateArray), intent(in) :: theCellStateArray real(8) :: essai essai = theCellStateArray%pw(1) end function essai end program main[/fortran] with the following module
[fortran]module cellStateArray_module implicit none private type, public :: cellStateArray private real(8), allocatable, dimension(:) :: pw_ !real(8), allocatable, dimension(:) :: a_ !real(8), allocatable, dimension(:) :: b_ !real(8), allocatable, dimension(:) :: c_ !real(8), allocatable, dimension(:) :: d_ !real(8), allocatable, dimension(:) :: e_ !real(8), allocatable, dimension(:) :: f_ !real(8), allocatable, dimension(:) :: g_ !real(8), allocatable, dimension(:) :: h_ !real(8), allocatable, dimension(:) :: i_ !real(8), allocatable, dimension(:) :: j_ !real(8), allocatable, dimension(:) :: k_ !real(8), allocatable, dimension(:) :: l_ !real(8), allocatable, dimension(:) :: m_ !real(8), allocatable, dimension(:) :: n_ !real(8), allocatable, dimension(:) :: o_ !real(8), allocatable, dimension(:) :: p_ !real(8), allocatable, dimension(:) :: q_ !real(8), allocatable, dimension(:) :: r_ !real(8), allocatable, dimension(:) :: s_ !real(8), allocatable, dimension(:) :: t_ !real(8), allocatable, dimension(:) :: u_ !real(8), allocatable, dimension(:) :: v_ !real(8), allocatable, dimension(:) :: w_ !real(8), allocatable, dimension(:) :: x_ !real(8), allocatable, dimension(:) :: y_ !real(8), allocatable, dimension(:) :: z_ contains private procedure, public :: init ! Getters procedure, public :: pw end type cellStateArray contains subroutine init(this, pw) class(cellStateArray), intent(inout) :: this real(8), allocatable, dimension(:), intent(in) :: pw allocate(this%pw_(1:size(pw))) this%pw_ = pw end subroutine init pure function pw(this, i) class(cellStateArray), intent(in) :: this integer, intent(in) :: i real(8) :: pw pw = this%pw_(i) end function pw end module cellStateArray_module[/fortran]
This program runs fine with ifort 12.1.3 and ifort 12.1.2. But as soon as you uncomment the members a_,..., z_, the size of the code generared by 12.1.3 (and versions above) for the code essai = theCellStateArray%pw(1) gets huge (It seems to be proportional to the number of members of the class cellStateArray) and the run time is multiplied by 10. Whereas with ifort 12.1.2, everything runs at the same speed.
I found this problem on Linux and Mac OS X.
Francois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the small test case. I was able to reproduce the issue and have escalated it to the developers. The issue number is DPD200232355. I will keep you informed of any updates I receive on this issue.
Regards,
Annalee
Intel Developer Support
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
movq %r8, 0x3463b6(%rip)
movq %rdi, 0x3463b7(%rip)
Each instruction is writing eight bytes of memory. However, the second of these two instructions is overwriting the the last seven bytes written by the first instruction.
This is rather unusual. I would have expected the target addresses to differ by at least the register size, namely, eight bytes -- had the second address been 0x3463be(%rip), that would have been more reasonable.
Please consider making up a small "reproducer" test program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At present, I am running Windows 7, and on this platform the compiler does not produce any instructions that contain instruction-pointer relative memory references (%rip+...).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The developers have informed me that the performance regression is the result of an important bug fix. They may look into improving the performance in the future, but the changes that resulted in this performance regression will not be undone.
Regards,
Annalee
Intel Developer Support

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page