I'm working on Optimizing a Fortran Application, and observing the effects on Vtune Amplifier XE 2013 along side. I'm experimenting with the precision of operations in one of lines in my code.
For instance ,
abc(1:BATCH_SIZE) = exp (-r_arr(1:BATCH_SIZE) * (1.d0/(4.d0 * ri * r1_arr(outer_j:jend))))
contains all 64-bit operands. Now, if I downgrade the operands to 32-bit and re-write my expression as :
abc(1:BATCH_SIZE) = exp (real(-r_arr(1:BATCH_SIZE)) * (1.0/(4.0 * real(ri) * real(r1_arr(outer_j:jend)))))
I get some reduction in the CPU Time from the per line counter through VTune (both cases profiled for same elapsed time) . Also, another metric that should change (increase) is the execution count of this line, because the 32-bit code will execute quicker than 64-bit code.
Which metric should I be looking for through VTune for comparison?
Guessing at your goal, I suppose you would set the same sampling rate and duration in your comparison runs and compare the number of samples taken. That could be an estimate of relative execution rate.
In my view - first at all, you might compare INST_RETIRED.ANY event counts of them to ensure they have save same (similar) workload, then use event CPU_CLK_UNHALTED.THREAD to know execution time; If event counts of INST_RETIRED.ANY are different, smaller one is better in algorithm (expression).
As a addition to Peter's comment if you are interested in FP performance then you can look at SIMD FP metrics and compare both version of the code. If your code was successfully vectorized you should look at number of events FP 64-bit packed and FP 32-bit packed:
Next step will be CPI calculation and comparison between two versions of the code.