I'm working on machines Intel core i5 6400 (2.7GHz, Windows 8.1, 8GB RAM) and Intel core i7-5930K (3.5 GHz, Windows 8.1, 32 GB RAM)
I need to build a roofline model for my application. For that purpose I use Intel Amplifer 2017 and Intel SDE.
When using SDE I measure Arithmetic Intensity counting mem-read and mem-write and the total number of GFLOP counting elements_fp_<...>
I use Intel Ampilfier 2017 to measure max bandwidth (from HPC Performance Characterization test results (for my machine it shows to be 17GB/s)).
However, when I profile mkl benchmark (matrix multiplication, sgemm function) I get different values of FLOP calculated by Intel Amplifier and Intel SDE. For Intel Amplifier 2017 the result is twice greater then for SDE. (It is not the case for STREAM benchmark, where I get approximately the same values!) Moreover, the arithmetic intensity calculated with SDE and GFLOPS calculated either by SDE or Amplifier give me the point which is well out of roofline model limits.
Is there any particular issues when using Intel Amplifier/SDE for mkl library functions?
Could you please let me know if I'm using Intel SDE and Amplifier in the right way to estimate FLOPs, Arithmetic Intensity and max bandwidth?
If you want to build a roofline model, I would recommend using Intel Advisor 2017 (update 1), not VTune Amplifier. As of the recent update, Advisor now has a built-in roofline analysis feature.
You'll need to set an environment variable to use roofline: ADVIXE_EXPERIMENTAL=roofline
And you'll need to make sure that your project properties have the "Collect FLOPS" checkbox ticked for trip counts.
Many thanks for your advise. I will try to build it with Intel Advisor 2017.
As a matter of interest, why Intel SDE and Intel Amplifier were giving me different estimations? And what could be wrong with Intel SDE estimation? I was precisely following this recommendations http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/and https://software.intel.com/en-us/articles/calculating-flop-using-intel-software-development-emulator...
Let me check one guess. The benchmark that you have different results on VTune and SDE most likely contains FMA instructions. And you wrote that you used elements_fp_<...> counting to calculate FLOPS. In SDE documentation it is pointed that you should also use VFMADD213PD_* to correctly count FLOPS for FMA case. Was that done in your case? If no - then it explains the situation since VTune counts FLOPS taking FMA into account. STREAM benchmark most likely does not have FMA so the results are close.
Thank you, Regards, Dmitry