Solved: Please suggest me an intel CPU for best performance of my code

Jayden_S_ · ‎12-23-2015

My code is quite simple

void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
    for (int i = 0; i < n; ++i)
    {
        a = b * a + c * (d + e + f + g);
    }
}

I want a very good performance. Please suggest me an intel CPU for best performance of my code. And any strategy to optimize its performance with intel c++ compiler? Each iteration has 6 floating point operations. Can you estimate the maximum FLOPS it can reach. Currently I can get only about 3G FLOPS in i7. Thank a lot for your suggestion!

McCalpinJohn · ‎12-28-2015

The performance will depend on where the data is located in the memory hierarchy. If "n" is large and the data is coming from memory, then the performance will be limited by memory bandwidth and the compiler optimizations will make relatively little difference. On a Core i7 the number of cores won't make much difference either, since a single core can use all of the memory bandwidth.

The kernel has 7 loads and 1 store for the 6 floating-point operations, so the 3 GFLOPS corresponds to 3*(8/6)=4 billion memory references per second. For doubles, this is 32 GB/s, which is more than the maximum memory bandwidth of the 2-channel Core i7 products. (The maximum memory bandwidth depends on the specific model, but is 25.6 GB/s for most recent models, and up to 29.8 GB/s for some of the 5th and 6th generation Core i7 models.)

This suggests that "n" is small enough for the data to be fully containable in some level of the cache. Parallelization will help for L3-contained data, and should provide essentially perfect scaling with core count for L2-contained data. I have not measured throughput for L3-contained data using multiple cores on the Core i7 products, but it should not be difficult to measure simple cases. For example, one could use the STREAM benchmark and set the array size so that the three arrays use about about 60% of the L3 cache. For example, with a Core i7 that has a 6 MiB cache, a STREAM array size of 150,000 elements would be about right. Then compile with something like:

icpc -O3 -xHOST -openmp -ffreestanding -opt-streaming-stores never -DSTREAM_ARRAY_SIZE=150000 stream.c -o stream.L3

This should work fine on Linux systems -- on Windows systems you would need a more accurate timer than the default to get useful results.

View solution in original post

TimP · ‎12-23-2015

Why not make it double * __restrict a so as to give the compiler a chance to optimize? Much of the Intel documentation of that qualifier is under the heading of restrict, using compile option /Qrestrict so as to bring that C99 feature in as a C++ extension.

If your arrays are large, memory bandwidth would be expected to be the limiting factor. On a platform with multiple memory controllers, you would want to assure that it is both vectorized and running enough threads to use all controllers (but not more than 1 per core). McCalpin's stream benchmark would be indicative of relative capability of candidate platforms.

#pragma omp parallel for simd

(with /Qopenmp compilation) asks for combined threading and vectorization, and implies __restrict.

If memory bandwidth is not a factor, you would want to assure that the compilation is using AVX2, if you have such a CPU (ICC option -xHost), as well as running 1 thread per core, e.g. with OMP_PLACES=cores, and experiment with unrolling (e.g. /Qunroll4).

Jayden_S_ · ‎12-24-2015

I tried but didn't get any better performance. Here is my full code

#include <iostream>
#include <vector>
#include <ctime>
using namespace std;

void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
    for (int i = 0; i < n; ++i)
    {
        a = b * a + c * (d + e + f + g);
    }
}

int main()
{
    int m = 1001001;
    vector<double> a(m), b(m), c(m), d(m), f(m);
    
    std::clock_t startcputime = std::clock();
    for (int i = 0; i < 1000; ++i)
        foo(1000000, &a[0], &b[0], &c[0], &d[0], &d[1], &f[0], &f[1000] );
    double cpu_duration = (std::clock() - startcputime) / (double)CLOCKS_PER_SEC;
    std::cout << "Finished in " << cpu_duration << " seconds [CPU Clock] " << std::endl;
}

Can you give me a workable example?

jimdempseyatthecove · ‎12-24-2015

void foo(int n, double* __restrict a, double* __restrict b, double * __restrict c,
double* __restrict d, double* __restrict e, double* __restrict f, double* __restrict g)
Jim Dempsey

McCalpinJohn · ‎12-28-2015

The performance will depend on where the data is located in the memory hierarchy. If "n" is large and the data is coming from memory, then the performance will be limited by memory bandwidth and the compiler optimizations will make relatively little difference. On a Core i7 the number of cores won't make much difference either, since a single core can use all of the memory bandwidth.

The kernel has 7 loads and 1 store for the 6 floating-point operations, so the 3 GFLOPS corresponds to 3*(8/6)=4 billion memory references per second. For doubles, this is 32 GB/s, which is more than the maximum memory bandwidth of the 2-channel Core i7 products. (The maximum memory bandwidth depends on the specific model, but is 25.6 GB/s for most recent models, and up to 29.8 GB/s for some of the 5th and 6th generation Core i7 models.)

This suggests that "n" is small enough for the data to be fully containable in some level of the cache. Parallelization will help for L3-contained data, and should provide essentially perfect scaling with core count for L2-contained data. I have not measured throughput for L3-contained data using multiple cores on the Core i7 products, but it should not be difficult to measure simple cases. For example, one could use the STREAM benchmark and set the array size so that the three arrays use about about 60% of the L3 cache. For example, with a Core i7 that has a 6 MiB cache, a STREAM array size of 150,000 elements would be about right. Then compile with something like:

icpc -O3 -xHOST -openmp -ffreestanding -opt-streaming-stores never -DSTREAM_ARRAY_SIZE=150000 stream.c -o stream.L3

This should work fine on Linux systems -- on Windows systems you would need a more accurate timer than the default to get useful results.

Jayden_S_ · ‎12-28-2015

In my real problem, it takes about 6 seconds for 36 GFLOPS (single precision) with single core on 64 bit machine. Memory loading and storing takes about 2 seconds. Thus the actual GFLOPS per second is 36/(6-2) = 9. What is the ideally peak performance for single core and multiple cores on 64bit i7? Is it possible to get closer to peak performance? I am currently using i7 2600k. How much performance speedup I can get if i7 4790k, i7 5960x, or Xeon E5-xxxx are used?

jimdempseyatthecove · ‎12-29-2015

i7-4790K 4.4 GHz, 4 cores, LLC 8MB, memory bandwidth of 25.6 BG/s (~$350)

i7-5960X 3.5GHz, 8 cores, LLC 20MB, memory bandwidth of 68 GB/s (~$1059)

E5-1650 v2 3.2GHz, 6 cores, LLC 12MB, memory bandwidth of 51.2 GB/s (~$583)

E5-1650 v3 3.5 GHz, 6 cores, LLC 15MB, memory bandwidth of 68 GB/s (~$586)

E5-1680 v3 3.2 GHz, 8 cores, LLC 20MB, memory bandwidth of 68 GB/s (~$1723)

You may want to consider the $/GFLOP. The i7 5960X CPU is almost 2x the cost of the E5-1650V3 (RAM and MOBO prices vary). You will have to determine the optimal number of cores. The simple code you illustrated is insufficient to make this determination. More cores typically result in more L1 and L2 cache availability, but lesser proportion of LLC. Also, depending on array sizes, it may be optimal to use less than the full complement of available threads or cores.

Note, the prices listed are the MSRP. The prices you find may vary significantly from the ones listed. Consider the total cost differential: CPU+Motherboard+RAM

(other components PS, case, video, ... are likely the same)

Jim Dempsey

jimdempseyatthecove · ‎12-29-2015

You could also consider the E5-26nn v3 series (you can choose to use just one processor)

E5-2699 v3, 2.3 GHz, 18 cores, LLC 45MB, memory bandwidth of 68 GB/s (1 processor) (~$4250?)
...
E5-2690 v3, 2.6 GHz, 12 cores, LLC 30MB, memory bandwidth of 68 GB/s (1 processor) (~$2094)
...
E5-2680 v3, 2.5 GHz, 12 cores, LLC 30MB, memory bandwidth of 68 GB/s (1 processor) (~$1749)
...
E5-2670 v3, 2.3 GHz, 12 cores, LLC 30MB, memory bandwidth of 68 GB/s (1 processor) (~$1593)

If on the prior post (#7) you were inclined to choose the E5-1680 v3, then it might be better to select the E5-2680 v3 as for about the same $, you get 50% more cores... and you have the option of using 2 CPUs in the event you require additional processor.

Jim Dempsey