Hello Marcus,

Marcus_K_ · ‎05-28-2015

We would expect that the server CPU E5 1650v3 with 15 MB L3, 6 cores @ 3.5 GHz, 4 populated memory channels (4 x 16GB DDR4/2133) is faster than its 'little brother', the desktop CPU i5 4570S with 6 MB L3, 4 cores @ 2.9 GHz, 2 memory channels (2 x 8 GB DDR3/1600), but surprisingly it is not!

At least, for the application that I wrote, the server CPU with 2 more and faster cores is slower than the desktop CPU. The application is multithreaded and looks up strings in an in-memory database with 32 tables with a total of 328 MB and uses AVX2 instructions. With a single thread and with 2 threads the 1650v3 is a fraction slower than the 4570S (which I cannot explain) but with the 3rd and 4th thread the server CPU is 10% and 16% slower, respectively. The sixth thread on the 1650v3 even decreases performance. The application uses scattered data and there is a high pressure on the cache and memory subsystems. A test on a 10-core E5 2670v2 shows that the application behaves well, an even with the increasing pressure on the subsystems performance increases up to the point where the number of threads is 10 (equal to the number of cores). Both the E5 1620v2 and i5 4570S outperform the E5 1650v3. The performance numbers are in this picture:

Only when the hyperthreaded cores are used, the 1650v3 is able to show better performance numbers but almost keeps even with the 1620v2 which has 2 cores less. The application runs on CentOS 7.1 and is compiled with gcc 4.8.3 that comes with it. All mentioned systems have the same OS and compiler and use the same dataset. In an attempt to find an explanation Advanced Power Management was turned off to prevent CPU throttling and hyperthreading was turned off but the results remained the same. The application has a feature to bind threads to cores, but no difference in the performance was seen. The temperature of the cores was also monitored and was never higher than 59º Celsius.

I read about the AVX base frequency which may be throttled to a lower value than the CPU base frequency but Intel has only a white paper about it and I did not find a specification of the AVX base frequency for individual CPUs.

The questions that I have:

why the performance drops on the 1650v3 for when the 3rd core is being used?
why the 1650v3 is not significantly faster than the 1620v2 and/or i5 4570S?

Thanks, Marcus

Patrick_F_Intel1 · ‎05-28-2015

Hello Marcus,

First a few questions.... what is the %idle for each system for the '1 cpu busy' case and the 'use all cpus' case?

Am I correct in my understanding that you have a fixed table size (32 tables of fixed size) and you are seeing what happens when you try to get an increasing number of cpus to access the fixed data? This type of cpu scaling frequently runs into problems (threading problems, contention, locks, etc). So this would be like running N independent (but running simultaneously) copies of your benchmark.

I wonder what would happen if you gave each CPU its own independent 32 tables and then see if you get perfect scaling. That is, if you get about 330K ops/sec for 1 cpu, do you get 660k ops/sec for 2 cpus (with 2x32 tables) etc. This would indicate that the system is able to scale without issues. Of course you might run into some other resource constraint such as not enough memory, or memory bandwidth saturated.

You can use Intel PCM to monitor basic stats such as memory bandwidth to see if you are saturating some resource.

Pat

Patrick_F_Intel1 · ‎05-28-2015

I meant to have the sentence ' So this would be like running N independent (but running simultaneously) copies of your benchmark.' after the paragraph starting with 'I wonder...'

McCalpinJohn · ‎05-28-2015

There are a fair number of differences across these platforms that may not be obvious initially. Before looking into that I would want to be sure that the threads are actually running where you think they are -- unless you bind the threads the OS is quite capable of running them in the wrong place(s), and may not even use all the cores when running a multithreaded job requesting them all. It would also help to run the Core i5-4570S with HyperThreading enabled so that a more direct comparison could be made (but only after you are able to bind the threads to specific logical processors -- otherwise more data is not helpful).

On the hardware side, the memory latency on the Xeon E5-1650 v3 is probably higher than either of the other two systems. This is a consequence of the changes in the uncore that were made to increase the maximum number of cores on the Haswell EP platform and is not unusual.

McCalpinJohn · ‎05-28-2015

Ooops -- I forgot to add that the AVX frequencies for the Xeon E5 v3 systems are listed in Table 3 of the "Intel Xeon Processor E5 v3 Product Family: Processor Specification Update" (Intel document 330785, revision 006, March 2015). Table 1 provides the general specifications, Table 2 provides the maximum Turbo frequencies when not using the 256-bit registers, and Table 3 lists the maximum Turbo frequencies when using the 256-bit registers. Table 3 also includes the minimum AVX frequency, which is 3.2 GHz on the Xeon E5-1650 v3. It is much more common that the chip will run at the maximum all-core AVX Turbo frequency of 3.5 GHz -- the 3.2 GHz tells you how low the frequency may have to drop to meet the chip power limitations when running the hottest possible workloads (typically the LINPACK benchmark).

Marcus_K_ · ‎05-28-2015

Hello Pat and John,

the %idle gradually increases from 0% (1 thread) to 2.5% (6 threads). The in-memory database has 32 tables and is read-only. Some tables are larger than others. There is no table scan, just an index scan. Just to be 100% sure that there is no contention issue, I reran the application without the single rwlock (one rwlock for the whole database) and the performance does not change.

Running the application N times with each instance having 1 thread does not make much sense to me since the L3 cache misses will increase and memory bandwidth demand will increase. Do you expect that the outcome of such test will help to make a conclusion about what is going on? Note that on other Intel CPUs the scalability is 'normal', only on the 1650v3 the scalability is a bit awkward.

The i5 4570S has no hyperthreading so a test with HT is not possible.

I had a hunch that maybe the AVX base frequency had something to do with the strange performance drop but with a minimum AVX frequency of 3.2 GHz this is no longer suspect.

I did not use PCM before and the data for runs with 2 and 4 threads are below. The IPC is significantly better on the 4570S while L3MISS is (as expected) higher on the 4570S. The L3HIT on the 4570S is acceptable and on the 1650v3 it is good. I do not know how to interpret the L2MISS.

The DDR3/1600 on the 4570S has CL9 while the DDR4/2133 on the 1650v3 has CL15 but because of the high L3HIT the difference in speed seems not very important.

On the i5 4570S with 2 threads:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK |  READ | WRITE |  IO   | TEMP |
   0    0     1.01   0.93   1.08    1.20    1000 K     22 M    0.96    0.61    0.06    0.32     N/A     N/A     N/A     41
   1    0     1.01   0.93   1.08    1.20     981 K     22 M    0.96    0.61    0.06    0.32     N/A     N/A     N/A     45
   2    0     0.04   0.99   0.04    1.13     441 K    739 K    0.40    0.44    0.70    0.10     N/A     N/A     N/A     50
   3    0     0.01   0.37   0.02    1.13     182 K    408 K    0.55    0.19    0.47    0.13     N/A     N/A     N/A     53
-----------------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.52   0.93   0.56    1.20    2605 K     46 M    0.94    0.60    0.07    0.32    0.68    0.06    0.47     N/A

 Instructions retired: 5973 M ; Active cycles: 6433 M ; Time (TSC): 2894 Mticks ; C0 (active,non-halted) core residency: 46.35 %
 C1 core residency: 4.93 %; C3 core residency: 1.23 %; C6 core residency: 0.97 %; C7 core residency: 46.52 %;
 C2 package residency: 1.26 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;
 PHYSICAL CORE IPC                 : 0.93 => corresponds to 23.21 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.52 => corresponds to 12.90 % core utilization over time interval

On the i5 4570S with 4 threads:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK |  READ | WRITE |  IO   | TEMP |
   0    0     0.92   0.85   1.08    1.10     867 K     24 M    0.96    0.53    0.05    0.35     N/A     N/A     N/A     42
   1    0     0.92   0.85   1.08    1.10     822 K     24 M    0.97    0.53    0.05    0.35     N/A     N/A     N/A     45
   2    0     0.91   0.85   1.08    1.10     862 K     24 M    0.96    0.53    0.05    0.35     N/A     N/A     N/A     44
   3    0     0.92   0.85   1.08    1.10    1002 K     24 M    0.96    0.53    0.06    0.35     N/A     N/A     N/A     46
-----------------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.92   0.85   1.08    1.10    3555 K     97 M    0.96    0.53    0.05    0.35    0.92    0.09    0.59     N/A

 Instructions retired:   10 G ; Active cycles:   12 G ; Time (TSC): 2894 Mticks ; C0 (active,non-halted) core residency: 98.00 %
 C1 core residency: 2.00 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %;
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;
 PHYSICAL CORE IPC                 : 0.85 => corresponds to 21.17 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.92 => corresponds to 22.90 % core utilization over time interval

On the E5 1650v3 with 2 threads:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK |  L3OCC | READ | WRITE | TEMP
   0    0     0.83   0.84   0.99    1.00     245 K     19 M    0.99    0.64    0.01    0.24     7080     N/A     N/A     47
   1    0     0.82   0.83   0.99    1.00     282 K     19 M    0.99    0.63    0.01    0.24     7560     N/A     N/A     45
   2    0     0.00   0.25   0.00    1.00    1828     2638      0.31    0.15    0.59    0.07       48     N/A     N/A     53
   3    0     0.01   1.46   0.00    1.00      19 K     24 K    0.20    0.54    0.27    0.02      600     N/A     N/A     52
   4    0     0.00   0.21   0.00    1.00    1169     5381      0.78    0.19    0.12    0.09        0     N/A     N/A     52
   5    0     0.00   0.39   0.00    1.00    6189     7221      0.14    0.07    1.03    0.04       72     N/A     N/A     48
-----------------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.28   0.83   0.33    1.00     555 K     38 M    0.99    0.63    0.01    0.24     N/A     0.15    0.05     N/A

 Instructions retired: 5821 M ; Active cycles: 6979 M ; Time (TSC): 3504 Mticks ; C0 (active,non-halted) core residency: 33.20 %
 C1 core residency: 0.38 %; C3 core residency: 0.00 %; C6 core residency: 66.42 %; C7 core residency: 0.00 %;
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;
 PHYSICAL CORE IPC                 : 0.83 => corresponds to 20.85 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.28 => corresponds to 6.92 % core utilization over time interval

On the E5 1650v3 with 4 threads:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK |  L3OCC | READ | WRITE | TEMP
   0    0     0.61   0.62   0.98    1.00     166 K     16 M    0.99    0.58    0.01    0.20     3720     N/A     N/A     43
   1    0     0.59   0.60   0.98    1.00     147 K     15 M    0.99    0.58    0.01    0.19     3672     N/A     N/A     42
   2    0     0.57   0.58   0.98    1.00     154 K     14 M    0.99    0.57    0.01    0.19     3576     N/A     N/A     44
   3    0     0.60   0.61   0.98    1.00     165 K     15 M    0.99    0.58    0.01    0.19     4344     N/A     N/A     44
   4    0     0.00   0.19   0.00    1.00    3442     7503      0.54    0.14    0.31    0.09       48     N/A     N/A     49
   5    0     0.00   0.40   0.00    1.00    6526     7547      0.14    0.07    1.12    0.04       24     N/A     N/A     45
-----------------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.40   0.60   0.66    1.00     644 K     62 M    0.99    0.58    0.01    0.19     N/A     0.08    0.06     N/A

 Instructions retired: 8310 M ; Active cycles:   13 G ; Time (TSC): 3501 Mticks ; C0 (active,non-halted) core residency: 65.55 %
 C1 core residency: 1.25 %; C3 core residency: 0.00 %; C6 core residency: 33.20 %; C7 core residency: 0.00 %;
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;
 PHYSICAL CORE IPC                 : 0.60 => corresponds to 15.09 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.40 => corresponds to 9.89 % core utilization over time interval

And something odd happened: on the E5 1650v3 pcm.x stated: "2 memory controllers detected with total number of 5 channels." Is that a bug?

Marcus_K_ · ‎05-29-2015

When I browsed through the generated assembly I noticed that gcc produced inefficient code on the E5 1650v3 and that having the same compiler flags on 1650v3 and 4570S does not mean that it produces the same code. More specifically, the compiler flags -O3 -march=core-avx2 -mtune=native produce efficient code on the 4570S but not on the 1650v3 for the _mm_set1_epi16 intrinsic function (where it uses memory as an intermediate). So when the compiler flags are changed to -O3 -march=core-avx2 -mtune=core-avx2 the same efficient code is produced on both 4570S and 1650v3. Apparently for gcc the flag -mtune=native can mean something different on various implementations of the Haswell architecture.

The performance results are in the following picture and are like one might expect: a core-by-core comparison gives higher ops/sec on the 1650v3. I also reran PCM on the 1650v3 and with the recompiled code the IPC for 4 threads is now the same as on the 4570S: 0.84 instead of 0.61. But I remain with a question: how to interpret the L2MISS of PCM. Are the values OK?

Patrick_F_Intel1 · ‎05-29-2015

Hello Marcus,

I find L2miss pretty hard to interpret as it relates to performance. If we had a case where the data should all fit in L1 or L2 then it would be easier to interpret L2miss... any L2miss would be unexpected. I'd have to dig through the PCM code to see how it is computing L2hit and L2miss. But lets take a simple interpretation. It looks like about half the requests that get to the L2 are satisfied in the L2. And of the requests that get to the L3, about 99% of them hit (find the data) in the L3 (on the e5 v3). So that is good.

So the L2miss being good or bad for your application depends on your application. If you expected fewer L2misses (or a higher L2hit), the PCM stats would alert you to investigate where in your code the L2 misses where occurring.

Pat

TimP · ‎05-30-2015

The point alluded to by others about pinning threads seems important for the dual CPU platform, including trying the cases of all threads pinned to 1 CPU up to the number of cores, and work divided evenly between cpus with threads sharing cache lines on the same CPU as much as possible. You won't see repeatable results for L2 miss without pinning.

Why is an E5 1650v3 slower than an i5 4570S ??