Intel processors and average floating point performance

LRaim · ‎07-18-2016

For development I am currently using a portable workstation with an Intel Core i7-4810MQ 2.80 GHz.
I am also planning to buy a faster desktop for long simulation, regression tests, etc. probably with a xeon processor.
Though I have spent some time in searching performance charts I have been unable get a clear idea of the possible speed-up of long simulations.
Has Intel some reference chart on floating point processor performance ?

TimP · ‎07-18-2016

https://www.spec.org/cgi-bin/osgresults?conf=cpu2006

https://www.spec.org/omp2012/results/

Intel leaves these submissions up to the vendors, and leaves it up to you whether any of those benchmarks relate to your interests.

Steven_L_Intel1 · ‎07-18-2016

There is not one single metric for "floating point performance" that is useful for selecting processor for a particular application. There are many variables, there is often a tradeoff among processors for these. You need to understand your application's execution characteristics to have a hope of making an informed choice.

How well does the application parallelize?
How well does the application vectorize?
How much memory traffic within and between threads?
What is the cache behavior?
And most important - how much is total application performance dominated by FP?

I suggest using VTune to understand the application's performance and bottlenecks. Look at cache misses, instruction stalls, etc. Tim, Jim Dempsey and others have a lot more experience doing these sorts of things and perhaps can give you further advice. Vectorization Advisor as part of Inspector XE can also be helpful.

If ultimate run-time performance is important you should also look at Xeon Phi.

Brooks_Van_Horn · ‎07-18-2016

I have a friend who does performance testing of hardware configurations for Adobe Premiere and what they have reported in the Adobe Forum is that the 18 core Xenon is poorly handled by Windows. In fact, the 14 core is the better performing version. Note that this is a Microsoft problem and NOT an Intel problem.

jimdempseyatthecove · ‎07-18-2016

Is this a 2x 18 core verses 2x 14 core?

If so, then it may be an issue between the O/S and the affinity pinning in OpenMP.

Windows uses a processor groups method on larger systems. Each group will have .LE. 64 logical processors. A group may have more than one CPU (package) but not more than 64 logical processors.

A 2x 14 core (2x 28 hardware thread, 56 HT) can map to one group, whereas 2x 18 core (72 HT) will map to two processor groups (each having 36 HT).

If the OpenMP library is not handling Windows affinity groups, then the behavior might be that the application is running 72 threads bound to 36 hardware threads.

I will venture to guess that the latest versions of OpenMP runtime for Windows handles multiple affinity groups.

Jim Dempsey

LRaim · ‎07-19-2016

Steve,

I was asking a simple question since I would expect such a table available from Intel. What I need is to elaborate a simple table with two columns: average floating point speed vs price. I.e. if I run a problem which completes in 2 hr how much should I spend to obtain the same result in 1 hr. The optimization of the application is another problem.

Best regards

jimdempseyatthecove · ‎07-19-2016

Luigi,

The performance of a specific application verses CPU will greatly depend upon the application.

Is it all scalar?
How much of the code is vectorizable?
What is the predominant vector width?
Will extended instructions aid (e.g. FMA)?
What size of L2 cache is most suitable?
What size of L3 cache is most suitable?
What are the effects of memory bandwidth?
To what degree is the code parallelizable?
Can the parallel code take advantage of NUMA configuration?
... other ...

For specific tests you might consider looking at the SPEC benchmarks (floating point results).
Or CPU Benchmarks. See bottom of page for various CPU/system filters (don't rely on prices).

For raw clock cycles Agner Fog has some information.
*** However these numbers are instruction cycles only and do not reflect the latencies from L1, L2, L3 and RAM (and concurrent latency effects of multiple threads).

Jim Dempsey

Steven_L_Intel1 · ‎07-19-2016

My point was that there is no simple answer to the question. Floating point performance has too many variables beyond the CPU. For example, if your application parallelizes well, more cores are better even if they are a bit slower. If not, then fewer, faster cores would generally be better. If it vectorizes well, a processor with the most advanced AVX instruction set would be best, otherwise another one mighty be a better choice. Beyond raw CPU speed, cache and memory behavior is also critical to actual performance.

I also mentioned Xeon Phi - this can be excellent when your application vectorizes and parallelizes well, but it's a more complex (and expensive) solution.

Any such table we published would be of little practical use unless your application exactly matches the benchmark program(s) run to determine the rating. Tim mentioned the SPEC behcnmarks - they're a good first start at seeing what is possible, but the system vendors tend to submit only top-of-the-line results and this isn't really helpful in deciding among current CPU models.

Lastly, I will note that cycle speed (GHz) is an imperfect measure of performance, as there are differences in efficiency among different processor generations.

There are third-party sites that attempt to do comparisons among different CPUs, but again you need to look at what they are actually testing as it tends not to be FP-focused. An example is https://www.cpubenchmark.net/

LRaim · ‎07-20-2016

Jim and Steve,
though I am (also) a sw developer I am first an engineer which has to solve problems, so the question is from a user point of view not from a developer perspective. E.g. to estimated the cost of a study a user could need to know how many simulation he can complete in 8 hours = 1 day of work.

Thanks for the links supplied.

Steven_L_Intel1 · ‎07-20-2016

The question is not answerable without analysis of the exact application used, and even then there are so many variables that trying to predict run-times is a pointless exercise.

jimdempseyatthecove · ‎07-20-2016

Luigi,

In ca 2005 I had a requirement to do some simulation work. After writing my own simulator I wanted something more robust and fully tested, and used by others. About that time I acquired a simulation package, written in Fortran 77. Approximately 750 source files and over 500,000 lines of code. The code was written without consideration of parallel coding (target platforms were only single core/cpu), and no vectorization (CPUs typically did not support SIMD). And small memory footprint.

Using the original code, the estimate of the simulation run time for high fidelity model, was years. The solution I took was to first convert from F77 to V95, principally to use allocatables, and then additionally to use pointers. The second phase of enhancement was to adapt to parallelization (OpenMP), then the third phase was to improve array layouts for SIMD vectorization.

On a 4 core with HT system, the coding changes yielded a 40x improvement in performance, and at whatever number of elements I wish to use.

The point I am getting at (and I assume Steve would agree) is a "simple question" of which is faster, greatly depends on the application.

Which is faster: a) Formula One car, b) Land Rover, c) John Deere tractor?

To answer that question, you need to know the workload.

One mile oval track, hill climb, plowing the field.

Applications have similar behavior.

Jim Dempsey