Solved: Why GotoBlas has so low efficiency(where is wrong for my steps)?

Rancho_L_ · ‎04-16-2015

I use GotoBlas and mpich to run hpl in the cluster(the cpu is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz). I use two ways to compile GotoBlas:(1)make (2)make USE_THREAD=0 TARGET=NEHALEM. The library used in the makefile of hpl is libgoto.a. However, the two different ways of compiling GotoBlas all leads to a low efficiency of HPL results: only 150GFlops(the theorical peak is 330 GFlops). Do I have some mistakes in compiling GotoBlas? Thanks for your answer.

TimP · ‎04-16-2015

As you don't use Intel xeon phi, the subject isn't topical here.

Did you see the advice to set core2 rather than Nehalem if you can't upgrade to openblas? The latter project would seem a better source for advice.

View solution in original post

TimP · ‎04-16-2015

As you don't use Intel xeon phi, the subject isn't topical here.

Did you see the advice to set core2 rather than Nehalem if you can't upgrade to openblas? The latter project would seem a better source for advice.

Rancho_L_ · ‎04-16-2015

Tim Prince wrote:

As you don't use Intel xeon phi, the subject isn't topical here.

Did you see the advice to set core2 rather than Nehalem if you can't upgrade to openblas? The latter project would seem a better source for advice.

Thanks for your answer. I know I asked the question in a wrong place, but I don't know where to find experts...I will try you method, thank you

McCalpinJohn · ‎04-16-2015

A more appropriate forum might be:

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring

In this case the answer is easy -- the Nehalem target generates SSE2/3/4 code (128-bit SIMD vectors), while your Xeon E5-2670 (Sandy Bridge EP) processor requires AVX code (256-bit SIMD vectors) to achieve full speed. So you are getting 150 GFLOPS out of a peak of 165 GFLOPS (using SSE code), which is about 91% of peak.

The author of GotoBLAS worked at TACC (http://www.tacc.utexas.edu/) when I started working at TACC in 1999. He left for industry well before we received our first Xeon E5 (Sandy Bridge EP) processors, so it was never optimized for that target. The OpenBLAS project (http://www.openblas.net/) added Sandy Bridge support, and is continuing to add support for Haswell and other newer processors.

Rancho_L_ · ‎04-17-2015

John D. McCalpin wrote:

A more appropriate forum might be:

https://software.intel.com/en-us/forums/software-tuning-performance-opti...

In this case the answer is easy -- the Nehalem target generates SSE2/3/4 code (128-bit SIMD vectors), while your Xeon E5-2670 (Sandy Bridge EP) processor requires AVX code (256-bit SIMD vectors) to achieve full speed. So you are getting 150 GFLOPS out of a peak of 165 GFLOPS (using SSE code), which is about 91% of peak.

The author of GotoBLAS worked at TACC (http://www.tacc.utexas.edu/) when I started working at TACC in 1999. He left for industry well before we received our first Xeon E5 (Sandy Bridge EP) processors, so it was never optimized for that target. The OpenBLAS project (http://www.openblas.net/) added Sandy Bridge support, and is continuing to add support for Haswell and other newer processors.

Thank you for your explanation. Finally I know the reason of low effiency.(My teacher will forgive me for such a low efficiency :) ) Thank you very much!