Why Xeon Phi always got bad efficacy?

Po_Chang_W_ · ‎11-20-2014

I tried to run a for loop 1,000,000,000 times on Xeon E5 and Xeon Phi, and measurement time to compare their efficacy, I'm so surprise I got the following result:

On E5 (1 Thread): 41.563 Sec
On E5 (24 Threads): 22.788 Sec
Offload on Xeon Phi (240 Threads): 45.649 Sec

Why I got the bad efficacy on Xeon Phi? I do nothing on the for loop. If my Xeon Phi coprocessor didn't had any problem, what work for Xeon Phi can get good efficacy? Must be vectorization? if not vectorization, can I do any thing on Xeon Phi use its threads to help me something?

TimP · ‎11-20-2014

It's difficult to make a meaningful trivial benchmark, but competitive performance on Intel(r) Xeon Phi(tm) usually requires a combination of effective vectorization and threading. It relies on both the greater vector width (512 bit) and the larger number of supported threads. Without expert hand coding, the best number of threads is unlikely to be as large as 240 (particularly for offload mode); the environment variable KMP_PLACE_THREADS is the easiest way to investigate.

TaylorIoTKidd · ‎11-20-2014

Po Chang,

You can find out if your code is vectorizing by using the -qopt-report features of the Intel compilers, specifically the "-qopt-report-phase=vec". See the ICC or ifort documentation.

Regards
--
Taylor

McCalpinJohn · ‎11-21-2014

I don't understand what you are measuring here....

If the "for" loop is doing nothing, then all you are measuring is OpenMP and offload overhead. In this case if you are only doing one offload and that single offload is running 1,000,000,000 iterations of a "for" loop, then your measurement is probably just OpenMP loop overhead.

OpenMP overheads for 240 threads on Xeon Phi are going to be larger than OpenMP overheads for Xeon E5 systems. The last time I checked, the EPCC OpenMP benchmarks in C (version 2) reported that the overhead of a PARALLEL FOR was about 8x higher for 240 threads on a Xeon Phi than for 16 threads on a 2-socket Xeon E5-2680 system. I did not try this in offload mode, so I don't know what difference that would make.

If you are looking for areas in which the Xeon Phi is faster than the Xeon E5, there are plenty of examples around. Xeon Phi is considerably faster on dense matrix arithmetic -- but this certainly requires vectorization. The Xeon Phi can also be considerably faster on bandwidth to/from L1 cache, L2 cache, and memory -- but these require attention to (various combinations of) vectorization, array alignment, software prefetching, etc.

Po_Chang_W_ · ‎11-22-2014

John D. McCalpin,

thanks for your answer, you let me know I measure the overhead in my code, because of I want to get all possible sub-string from a long string on Xeon Phi, but I got a very bad performance on Xeon Phi, I knew if I want get a good performance on Xeon Phi need parallelism and vectorization, so on my case, is it possible get good performance on Xeon Phi? If my point not wrong, this problem can't vectorization, what can I do?

And thanks everyone help me to know my question, Thanks!

jimdempseyatthecove · ‎11-22-2014

>>because of I want to get all possible sub-string from a long string

Define string and define what constitutes a sub-string as it applies to your circumstance. Are there other considerations of how you perform the sub-string, for example, are your sub-strings recursive. That is when multiple relatively long sub-strings are found, are each subsequently sub-stringed .or. is the relatively long sub-string sub-stringed once.

Is this a compression problem? Or is it a classification problem?

Jim Dempsey

McCalpinJohn · ‎11-22-2014

It is not clear what sorts of strings you are using, but it is important to be aware that the SIMD instruction set on Xeon Phi does not support 8-bit or 16-bit packed data types. So most applications operating on 8-bit or 16-bit values will run in scalar mode, which is certainly not optimal for Xeon Phi.

You did not specify which generation of Xeon E5 you used for comparison. The AVX instruction set supported by the first generation and second generation Xeon E5 (Sandy Bridge EP and Ivy Bridge EP) also does not support 8-bit or 16-bit packed integers in the 256-bit AVX registers, but they do support 8-bit and 16-bit packed integers in the 128-bit SSE instruction set. The 3rd generation Xeon E5 (Haswell EP) supports the AVX-2 instruction set, which adds support for 8-bit and 16-bit packed integers in the 256-bit AVX registers.

jimdempseyatthecove · ‎11-22-2014

It should be self evident that multi-byte compares can be performed within supported integer values. A vector of 4-byte integers in AVX... registers, combined with a mask can compare multi-byte strings. A 32-bit integer can compare 1, 2, 3 and 4 bytes.

Assume as an example you are looking for a 12-byte sub-string that can have an arbitrary byte alignment, a permutation of 48 of these sub-strings could be identified using 12 different juxtapositions of data in registers. The algorithm could be very fast... as long as you do not program it as you would a sequential program in standard C.

Jim Dempsey