Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Performance issue & operating on local sub-arrays vs using modules with large arrays

Ioannis_K_
New Contributor I
358 Views

Hello,

I have a large program (with lots of numerical computations on arrays etc.), whose performance is not as fast as expected. I recently realized that the various Optimization options in my compiler do not significantly impact the performance .

One of the aspects of my program is that it includes many large arrays, whose components are to be used multiple times by various subroutines in the program (tens of thousands of times in a run, to be precise). The arrays are defined in modules and can be shared among routines. Instead of following this approach, I create small sub-arrays before calling the subroutines. For example, let us imagine that I have a large vector {A}, defined in a module, and several of its components (not necessarily a contiguous block) are to be used by a subroutine. What I do in my program is define a small subarray {a1}, into which I place the individual components of {A} that are to be used by the subroutine. The subroutine then operates on {a1}, and after it is done, I may update individual components of {A} using updated values of {a1}. 

My question is whether my code could become faster if I directly operated on the large arrays, instead of working with small sub-arrays. In the context of the example that I provided, this would mean that my subroutine uses the module where {A} is defined, and it directly operates on the particular components of the large array {A} instead of the sub-array {a1}. I originally thought that the approach I am employing should be the fastest, but I am not 100% sure anymore (especially if my approach somehow prevents automatic optimization).

Thanks in advance for any help/suggestions,

Yannis

0 Kudos
6 Replies
jimdempseyatthecove
Honored Contributor III
358 Views

>>My question is whether my code could become faster if I directly operated on the large arrays, instead of working with small sub-arrays.

This depends on if the among of work performed is significantly large to the copy (twice) overhead ... and if operating on the compacted data yields a savings over operating directly on the data in {A}.

Copying can often be made unnecessary by structuring your data for efficient optimization as opposed to an abstract design. IOW OOP style is generally not efficient for optimization.

Jim Dempsey

0 Kudos
andrew_4619
Honored Contributor II
358 Views

Do you have Vtune?

You can waste a lot of time guessing where the bottlenecks might and trying things.  Vtune should  identify where your program is spending a lot of time. You can probably get a trial licence if you do not have it.

 

 

0 Kudos
FortranFan
Honored Contributor II
358 Views

Ioannis K. wrote:

.. I create small sub-arrays before calling the subroutines. For example, let us imagine that I have a large vector {A}, defined in a module, and several of its components (not necessarily a contiguous block) are to be used by a subroutine. What I do in my program is define a small subarray {a1}, into which I place the individual components of {A} that are to be used by the subroutine. The subroutine then operates on {a1}, and after it is done, I may update individual components of {A} using updated values of {a1}. ..

The above description suggests frequent working with discontiguous data as well as copying of data from a global workspace (possibly on the heap) to local memory (most likely on the stack), both of which could hinder performance.  As suggested, performance analyzers such as VTune can provide better insight into the actual issues.  At the very least, looking into options to work with contiguous memory and minimizing the copying of data might be worth a consideration.

0 Kudos
Ioannis_K_
New Contributor I
358 Views

Andrew,

Thank you for the suggestion. Unfortunately for me, I cannot get VTune to work properly. For some reason, the generated report does not list the ACTUAL function names. So, I get function names such as "func@0x48eb8f", so I have no clue where the bulk of the CPU time may be spent.

I try to create a report for a Release configuration. I keep the Debug option to "Full", but it does not help.

I had started a topic in the Intel forums, but I still could not find a procedure to actually see the actual function names in the VTune report. 

0 Kudos
andrew_4619
Honored Contributor II
358 Views

in reply to #5, does it help if the exe is built using the /traceback option?

0 Kudos
jimdempseyatthecove
Honored Contributor III
358 Views

When you compile the Release version for use with VTune, you should instruct the compiler to output the Debug Symbol table (and instruct the Linker not to strip the Debug information).

Jim Dempsey

0 Kudos
Reply