Math intrinsic performance issue with ifort17

mriedman · ‎01-26-2018

After migrating my code from ifort15 to ifort17 I observe a substantial slowdown caused by the pow() intrinsic. The runtime profile shows __bwr_pow with almost 3x the sample count compared to the previous pow.L. Also the SVML version is notably slower. The other routines have not changed much.

Build and profiling were done on a Haswell running RH6. Relevant FP compile switches are:

-qopenmp -O3 -fp-model fast=1 -r8 -xAVX -fp-model fast=2 -no-prec-div -no-prec_sqrt -ftz -fast-transcendentals

Profile with ifort15 exe:

samples  %        symbol name
158163   12.9953  __kmp_hyper_barrier_release
118068    9.7009  bicgstab_solv_mp_sparse_matvec_
92752     7.6209  pow.L
88174     7.2447  xschem_mod_mp_xschem_part2_continuity_energy_
80151     6.5855  bicgstab_solv_mp_bicgstab_kahan_dp_
57120     4.6932  intfr_IP_intfr_chan_
39386     3.2361  heat_mp_heat_loop_a_
31915     2.6223  __svml_pow4_l9
31106     2.5558  xschem_mod_mp_xschem_part1_momentum_
29114     2.3921  post3d_

Profile with ifort17 exe:

samples  %        symbol name
240081   17.4939  __bwr_pow
132852    9.6805  _INTERNAL_25_______src_ ... __kmp_hyper_barrier_release
116221    8.4686  bicgstab_solv_mp_sparse_matvec_
80016     5.8305  xschem_mod_mp_xschem_part2_continuity_energy_
78456     5.7168  bicgstab_solv_mp_bicgstab_kahan_dp_
57502     4.1900  intfr_IP_intfr_chan_
43250     3.1515  __svml_pow4_br_e9
39836     2.9027  heat_mp_heat_loop_a_
34180     2.4906  __kmp_yield
32767     2.3876  __bwr_floor
26206     1.9095  xschem_mod_mp_xschem_part1_momentum_
21549     1.5702  post3d_

Any idea how this runtime regression can be resolved ?

mriedman · ‎02-01-2018

It has turned out that this issue is caused by a mess of compile switches. If the options list has "-fp-model consistent" and somewhere after there is "-fp-model fast=2" then the conservative option still overrides the aggressive option. The last option does not win. Not sure if this is a bug or a feature.

The conservative option puts different math intrinsics versions in place, in this case __bwr_pow.