in reply to #5, does it help

Ioannis_K_ · ‎09-09-2019

Hello,

I have a large program (with lots of numerical computations on arrays etc.), whose performance is not as fast as expected. I recently realized that the various Optimization options in my compiler do not significantly impact the performance .

One of the aspects of my program is that it includes many large arrays, whose components are to be used multiple times by various subroutines in the program (tens of thousands of times in a run, to be precise). The arrays are defined in modules and can be shared among routines. Instead of following this approach, I create small sub-arrays before calling the subroutines. For example, let us imagine that I have a large vector {A}, defined in a module, and several of its components (not necessarily a contiguous block) are to be used by a subroutine. What I do in my program is define a small subarray {a1}, into which I place the individual components of {A} that are to be used by the subroutine. The subroutine then operates on {a1}, and after it is done, I may update individual components of {A} using updated values of {a1}.

My question is whether my code could become faster if I directly operated on the large arrays, instead of working with small sub-arrays. In the context of the example that I provided, this would mean that my subroutine uses the module where {A} is defined, and it directly operates on the particular components of the large array {A} instead of the sub-array {a1}. I originally thought that the approach I am employing should be the fastest, but I am not 100% sure anymore (especially if my approach somehow prevents automatic optimization).

Thanks in advance for any help/suggestions,

Yannis

jimdempseyatthecove · ‎09-10-2019

>>My question is whether my code could become faster if I directly operated on the large arrays, instead of working with small sub-arrays.

This depends on if the among of work performed is significantly large to the copy (twice) overhead ... and if operating on the compacted data yields a savings over operating directly on the data in {A}.

Copying can often be made unnecessary by structuring your data for efficient optimization as opposed to an abstract design. IOW OOP style is generally not efficient for optimization.

Jim Dempsey

andrew_4619 · ‎09-10-2019

Do you have Vtune?

You can waste a lot of time guessing where the bottlenecks might and trying things. Vtune should identify where your program is spending a lot of time. You can probably get a trial licence if you do not have it.

FortranFan · ‎09-10-2019

Ioannis K. wrote:
.. I create small sub-arrays before calling the subroutines. For example, let us imagine that I have a large vector {A}, defined in a module, and several of its components (not necessarily a contiguous block) are to be used by a subroutine. What I do in my program is define a small subarray {a1}, into which I place the individual components of {A} that are to be used by the subroutine. The subroutine then operates on {a1}, and after it is done, I may update individual components of {A} using updated values of {a1}. ..

The above description suggests frequent working with discontiguous data as well as copying of data from a global workspace (possibly on the heap) to local memory (most likely on the stack), both of which could hinder performance. As suggested, performance analyzers such as VTune can provide better insight into the actual issues. At the very least, looking into options to work with contiguous memory and minimizing the copying of data might be worth a consideration.

Ioannis_K_ · ‎09-10-2019

Andrew,

Thank you for the suggestion. Unfortunately for me, I cannot get VTune to work properly. For some reason, the generated report does not list the ACTUAL function names. So, I get function names such as "func@0x48eb8f", so I have no clue where the bulk of the CPU time may be spent.

I try to create a report for a Release configuration. I keep the Debug option to "Full", but it does not help.

I had started a topic in the Intel forums, but I still could not find a procedure to actually see the actual function names in the VTune report.

andrew_4619 · ‎09-11-2019

in reply to #5, does it help if the exe is built using the /traceback option?

jimdempseyatthecove · ‎09-13-2019

When you compile the Release version for use with VTune, you should instruct the compiler to output the Debug Symbol table (and instruct the Linker not to strip the Debug information).

Jim Dempsey

Performance issue & operating on local sub-arrays vs using modules with large arrays