topic You would require icc -fast=2 in Software Archive

Poor FFT mkl performance (2)

Vladimir_Dergachev — Tue, 11 Nov 2014 18:59:38 GMT

(This is a new issue after the issue with mkl_dft_grasp_user_thread() was solved in previous post by using separate fft plans in each thread.)

I am optimizing a new application (written with Xeon Phi in mind) which performs a lot of FFT transforms.

The transforms are done on 512x512 arrays separately in each thread. This works quite well on Xeon host. When running on Xeon Phi in native mode the performance is much slower than expected - at best 50% of the performance of the host.

I attached a screenshot of Vtune Amplifiier. It shows that most of the time is spent in __svml_cdiv8_ha_mask() function - I was not able to find the source for it. This is followed by very high CPU usage of sinl()/cosl() functions which show no vectorization.

This is highly surprising because I would have expected sin/cos to be called during plan creation and, most importantly, these are double ffts, and have no reason to call sin/cos (in fact I do not use long double anywhere in my program).

Any suggestions ?

thank you very much

Vladimir Dergachev

This turned out to be due to

Vladimir_Dergachev — Wed, 12 Nov 2014 02:08:08 GMT

This turned out to be due to two independent cause:

* the high usage in __svml_cdiv8_ha_mask() went away after upgrading to newer version of mkl

* sinl()/cos() was not due to fft library, but because elsewhere in the code I used cexpf() function. Somehow this ends up calling sinl()/cosl()/expl() in Xeon Phi code, which is very strange.

best

Vladimir Dergachev

You would require icc -fast=2

TimP — Wed, 12 Nov 2014 04:41:28 GMT

You would require icc -fast=2 to engage fast limited range and domain complex math functions. The options for full range complex arithmetic on mic are unsatisfactory (perhaps I too might have called them strange if I read only the ad about supporting full IEEE floating point). The strange point is that there is no way with these options to gain language compliance for parentheses, so it may involve extreme defensive coding, beyond the extent you would expect for limited range (roughly half the exponent range of float data type).

Thanks for the suggestion !

Vladimir_Dergachev — Wed, 12 Nov 2014 17:37:41 GMT

Thanks for the suggestion !

It was even stranger than you describe - cexpf() should operate on single floats, but the functions actually called were computing for long double.

The solution was to simply not use complex functions and just call sin()/ cos() directly.