Quote:Hans Pabst (Intel)

OpenHero · ‎03-11-2013

I want to get those source code ro binary on cpu and mic. show as in http://software.intel.com/en-us/intel-mkl

Benchmarks--> intel xeon phi Cooprocessor.

I want to benchmark them on mic.

Who can tell me where to get them?

Hans_P_Intel · ‎03-11-2013

I know there where benchmark/source code packages available via Premier (even prior to the launch). However, the High Performance Linpack (HPL) can be downloaded from http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download. The Linux archive contains the binaries for Intel Xeon Phi.

Regarding the other benchmarks, you probably refer to http://software.intel.com/en-us/intel-mkl#pid-12768-1295 ("BENCHMARKS" tab, "Intel® Xeon Phi™ Coprocessor" tab) -- here, I think the source code of the benchmarks used to generate the graphs/charts is not available for download.

The fore mentioned benchmarks are meant to be straight-forward calls into Intel MKL. If you look at the footnote of each of the charts you will find information on how to reproduce these numbers, the system setup, etc. It should be fairly easy to reproduce these numbers. I think the latest MPSS made it even easier because some of the usual adjustments (huge pages, etc.) are aimed to happen automatically.

As you already know, a place to look for developer information is http://software.intel.com/mic-developer/. Once you got your benchmark code ready, please have a look at thread affinitization. In particular, KMP_PLACE_THREADS makes your life easier (Compiler 13 Update 2, see also here). Feel free to share your numbers, and also ask for help with Intel MKL if you need.

OpenHero · ‎03-11-2013

Hans Pabst (Intel) wrote:

I know there where benchmark/source code packages available via Premier (even prior to the launch). However, the High Performance Linpack (HPL) can be downloaded from http://software.intel.com/en-us/articles/intel-math-kernel-library-linpa.... The Linux archive contains the binaries for Intel Xeon Phi.

Regarding the other benchmarks, you probably refer to http://software.intel.com/en-us/intel-mkl#pid-12768-1295 ("BENCHMARKS" tab, "Intel® Xeon Phi™ Coprocessor" tab) -- here, I think the source code of the benchmarks used to generate the graphs/charts is not available for download.

The fore mentioned benchmarks are meant to be straight-forward calls into Intel MKL. If you look at the footnote of each of the charts you will find information on how to reproduce these numbers, the system setup, etc. It should be fairly easy to reproduce these numbers. I think the latest MPSS made it even easier because some of the usual adjustments (huge pages, etc.) are aimed to happen automatically.

As you already know, a place to look for developer information is http://software.intel.com/mic-developer/. Once you got your benchmark code ready, please have a look at thread affinitization. In particular, KMP_PLACE_THREADS makes your life easier (Compiler 13 Update 2, see also here). Feel free to share your numbers, and also ask for help with Intel MKL if you need.

Thank you very much! I will try those ways, and them I will publish the result:)

Hans_P_Intel · ‎03-11-2013

As you probably know, you have multiple options to run code on Intel Xeon Phi. In case of Intel MKL, you have the auto-offload (AO) option in addtion to offloading the code, or to run an application on the coprocessor. With AO, you would use the host system and the coprocessor heterogenously. In case of GEMM (and probably soon in other cases as well), one can utilize multiple coprocessors. Anyhow, I guess you are interested on pure Xeon Phi performance probably even without data transfer?

For example with GEMM, you can offload the GEMM call, and measure the time within the offloaded region. Pseude code:

#pragma offload
void myfunc(..., time, ...)
{
   start = tick();
   DGEMM(...);
   time = tick() - start;
}

As you can see you can offload an entire call chain of arbitrary code to the coprocessor. In the above case, the timing is done inside the offloaded region; hence you will omit the time of the data transfer. Of course you can compile the application also for "native MIC". Anyhow, the code inside an offload code region is not less native than cross-compiling with "-mmic"; hence "native" is sometimes better called Manycore-hosted. As a current update, notice that the fore mentioned Update 2 of the Intel Compiler made OpenMP 4.0 based pragmas/directives available as well. Of course, this is in addition to the Language Extensions for Offload ("LEO").

For the timing, I think the thread affinitization and the memory alignment are the main components for performance. Let me know if you have further questions.

OpenHero · ‎03-11-2013

Thanks Hans.

SergeyKostrov · ‎03-11-2013

>>...I think the thread affinitization and the memory alignment are the main components for performance... Also, - A C++ compiler optimization ( for example, Debug non-optimized codes are always slower compared to Release optimized codes compiled with option /O2 ) - Instruction set ( for example, AVX2 / AVX codes will be faster than SSE2 / SSE codes ) - Correct cache management ( for example, Loop-Blocking optimization technique is better that a simple For-Loop processing ) - A highly optimized algorithm ( for example, Classic matrix multiplication algorithm is significantly slower compared to Strassen matrix multiplication algorithm ) - Multi-threading support ( for example, application of OpenMP could easily improve performance of some processing ) There are many-many different things that could affect performance of some test or a real application. Take a look at a thread for more details: Forum Topic: A basic question about auto vectorization of 3-level nested loop Web-link: software.intel.com/en-us/forums/topic/370360 This is an example of how a very small modification negatevely affected performance of some processing.

need the source code or exe binary.