prefetching for a specific code

Valjean__Jean · ‎05-12-2020

I would like to tune the attached code for Xeon Phi as fast as possible.
As referance I have SSE, which is around 10 times faster.
SSE E5-1650v2 6 core = 0m1.081s
KNC 5110P = 0m11.218s
(KNL is still in work, I must do it in two halfs because of missing AVX512BW ...).

To simplify, in the attachement I have data with a represantative portion of the code in one file.
Please comment/uncomment the lines for different architecture and fold/colapse the data area to make it smaller and clear.
The code is just a bubble sort with arrays of vectors, than sum the element of the vectors, than store it if minimum (in an omp critical).
There is enough work to full-fill 240-280 threads, the original code has around 700 MPI tasks with 80 OMP threads each task.

most important question
how should I write prefetching?
I've tried pragmas, _mm_prefetch and _mm_clevict but I didn't manage well (pragmas improved 10%)
I would like to come as close to SSE is possible.

q. 2
is prefetching depending on configuration?
for example a different coprocessor 3110, can ask for different values for prefetching?
or changing the RAM in KNL can also require different values in code?
if you give me a solution, can be for my configuration slightly different values?

q. 3
is it possible to use a faster stable sorting algorithm (inserting or counting) for vectors?
Searching on internet I saw something like: it is possible only in newer architectures because of newer intrinsics for tracking the conflicts/indexes.

McCalpinJohn · ‎05-13-2020

This performance difference is not surprising -- the Xeon E5-1650 v2 can run at up to 3.9 GHz and up to 4 instructions per cycle, while the Xeon Phi runs at 1.05 GHz with up to 2 instructions per cycle.

Your KNC code looks like a very good start for KNL -- most of the intrinsics are the same the ones you use are almost all fast. The slowest intrinsics in the KNC code are likely to be the mm512_reduce_add_epi32 operations. These don't map to a single underlying AVX512 instruction, so you would have to look at the generated binary to see what the compiler is doing with them. The mm512_mullo_epi32 has a latency of 7 cycles and can only run in one of the two FP units, but it is fully pipelined so you can start one per cycle.

Unlike KNC, KNL is compatible with all the older instruction sets, so you can run SSE and/or AVX code there directly. Whether this is a good idea or not depends on the specific instructions -- most of the more common instructions run at "full speed", but many of the less common variations are microcoded and run much more slowly (~10x). The mm_blendv_epi8 intrinsic in your SSE code maps to the VPBLENDVB instruction, which on KNL only executes in one of the two FP units and runs at a throughput of 1 instruction every 8 cycles. The Xeon E5-1650 v2 has a throughput of one instruction every two cycles on the same instruction.

A comprehensive set of instruction timings for Intel/AMD/Via processors that includes KNL is Agner Fog's "instruction tables" -- https://www.agner.org/optimize/instruction_tables.pdf

McCalpinJohn · ‎05-13-2020

The VPCONFLICTD instruction is available in KNL and the implementation is quite fast -- 3 cycle latency and one instruction per cycle throughput. I don't know how to use it for your application, but if you have access to a KNL at least you can try that approach.

Valjean__Jean · ‎05-14-2020

Ahh, I was looking in the wrong direction for few days ...thank for showing me the right way. So, if change some slow intrinsics and I go down with the time to 8-9 sec, I can declare myself satisfied. It make sense, data is small to see a big gain as prefetching from RAM and my second question is wrong.

About inserting/counting sort, the discussion was 2-3 years ago when the intrinsics for conflicts were new, I will try now one more time.

Thank you again for guiding me, I'm new in the subject, I didn't know about latency table (maybe I jumped it over) or where should I look to the architecture ("two FP units")

McCalpinJohn · ‎05-14-2020

Agner Fog's "instruction_tables.pdf" is the most comprehensive single document for latency and throughput, with the added benefit of including AMD (and Via) processors and maintaining all the historical results in mostly the same presentation form.

Agner Fog's "microarchitecture.pdf" (https://www.agner.org/optimize/microarchitecture.pdf) is the most comprehensive single document to describe the microarchitecture of Intel/AMD/Via microprocessors. I have found this to be an excellent tutorial because it starts all the way back with the Intel Pentium (when the processor implementation was much less complex) and allows the reader to learn about the increasing complexity of the processors incrementally as each new generation is introduced.

A resource with a massive and detailed database of instruction throughput and latency data for Intel processors is https://uops.info (does not include KNL).

Intel's "Intel® 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, revision 042b, September 2019) is also an excellent reference. Chapter 2 gives excellent microarchitecture overviews for a subset of Intel processors (not including KNL), and Appendix C provides a fair amount of data on instruction latency and throughput (but not as comprehensive as Agner's tables). Sometimes Intel makes it hard to find their documentation on the web, but they have been doing better lately, with https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html serving as a starting point for access to all of the software developer guides and optimization manuals (but not the model-specific documentation, such as the "specification updates").

Another good source for architecture/microarchitecture information is https://en.wikichip.org/wiki/WikiChip. The coverage is uneven, but when topics are covered, they are generally covered very well.

Valjean__Jean · ‎05-15-2020

thank you for your time (doesn't meter the quality of the students if the teacher good, hahaha)