Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1757 Discussions

Haswell E5-2697V3 has bad network latency when has workload on cpu

huang__yixuan
Beginner
1,669 Views

Hello All, recently, I met one issue when I used Haswell e5-2697v3. 

Compare with Ivybridge, Haswell will cause more latency even in high workload (most of CPUs core with 100%).

My environment has the following systems.

Node1: x3650M4 Ivybridge (E5-2660 v2*2 10cores) with 32GB. (Turbo enable, HT enable, running 2.6GHz)

Node2: x3650m4 Ivybridge (E5-2660 v2*2 10cores) with 32GB  (Turbo enable, HT enable, running 2.6GHz)

Node3: x3550M5 Haswell (E5-2697v3*2 14cores) with 64GB.  (Turbo enable, HT enable, running 3.0 GHz)

All of them use Solarflare 8522 plus. OS is CentOS 7.3.  And isolate from core 2 - the last core. cmdline with those parameter:

nosoftlockup intel_idle.max_cstate=0 idle=poll mce=ignore_ce isolcpus=2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39 rhgb quiet
 

I setup Node 1 (192.168.1.5 ) running sfnt-pingpong server side, and use sfnt-pingpong tcp 192.168.1.5 to test. Command is

"onload -p latency ./sfnt-pingpong"

On Node2 and Node3, test command: taskset -c 1 ./sfnt-pingpong tcp 192.168.1.5 

Node2: (Ivybridge)

#       size    mean    min     median  max     %ile    stddev  iter
        1       8560    5474    8344    207867  13039   1507    174000
        2       8549    5502    8336    201190  13063   1536    160000
        4       8542    5554    8320    87953   13105   1468    168000
        8       8536    5464    8315    197683  13072   1544    168000
        16      8541    5553    8316    83730   13076   1473    168000
        32      8597    5564    8384    193852  13011   1511    168000
        64      8650    5593    8445    191432  13047   1492    173000
        128     8744    5754    8535    82277   13243   1423    166000
        256     9172    5959    8959    192138  13788   1543    158000
        512     9335    6222    9120    87053   13953   1474    153000
        1024    9984    6794    9762    202416  14720   1606    150000
        2048    16220   10593   16087   196159  21930   2223    91000
        4096    18932   12976   18847   95825   24935   2265    77000
        8192    28032   16939   28929   229866  36352   4477    54000
        16384   36081   26368   32452   246993  61994   9669    42000
        32768   62912   55610   62802   127273  68710   2357    24000
        65536   95040   75621   94538   248342  122023  4357    16000
 

Node3: (Haswell)

#       size    mean    min     median  max     %ile    stddev  iter
        1       22396   7387    21925   54317   36919   5256    64000
        2       22421   7765    21929   349632  36740   5406    67000
        4       22421   7270    21920   380783  36657   5454    65000
        8       22396   7064    21898   390140  36541   5478    65000
        16      22525   8001    22053   53510   36666   5213    63000
        32      22510   8242    22006   389380  36841   5480    65000
        64      22579   8175    22128   216743  36745   5309    65000
        128     22624   7745    22129   319480  36782   5366    67000
        256     23109   8562    22646   53088   37497   5258    62000
        512     23294   8716    22805   394718  37486   5511    63000
        1024    23725   8141    23192   371761  38276   5439    64000
        2048    30433   13477   29885   156189  46162   5947    48000
        4096    33248   16947   32699   195047  49497   6023    45000
        8192    43065   21124   39860   91991   71218   11168   34000
        16384   64104   31076   66684   97144   82430   10796   23000
        32768   79261   51736   78918   416892  96172   7035    19000
        65536   139587  107606  138703  477958  167153  10244   11000

 

Haswell seems 3 times lower than Ivybridge.

If there has no workload on both systems.  (Node1 running command './sfnt-pingpong')

Node2 and Node 3 running command: taskset -c ./sfnt-pingpong tcp 192.168.1.5)

Node2 (Ivybridge):

#       size    mean    min     median  max     %ile    stddev  iter
        1       8214    7350    8074    175162  9870    665     182000
        2       8201    7356    8107    170388  9792    588     183000
        4       8178    7384    8084    82777   9750    423     183000
        8       8190    7331    8063    166788  9762    602     183000
        16      8192    7348    8054    81091   9856    505     183000
        32      8213    7410    8122    169686  9757    586     182000
        64      8272    7446    8174    167164  9980    582     181000
        128     8502    7643    8387    80151   10112   478     176000
        256     9114    8181    8972    166018  10762   633     164000
        512     9241    8402    9151    82958   10864   464     162000
        1024    9811    8902    9718    174087  11409   605     153000
        2048    17464   14345   17425   169174  19672   1008    86000
        4096    20115   17909   20081   90427   22178   666     75000
        8192    25873   23347   25679   193017  29002   1356    58000
        16384   34350   31148   34174   93980   38062   1331    44000
        32768   50793   47413   50510   211654  54931   1728    30000
        65536   90710   84953   90618   252523  95190   2205    17000

Node3: (Haswell):

#       size    mean    min     median  max     %ile    stddev  iter
        1       9558    8648    9493    32775   10861   376     157000
        2       9549    8641    9480    31600   10833   380     157000
        4       9514    8685    9446    169606  10792   544     158000
        8       9513    8590    9446    33736   10821   371     158000
        16      9504    8632    9436    32550   10809   393     158000
        32      9518    8545    9454    172947  10764   548     158000
        64      9595    8598    9531    32384   10840   364     156000
        128     9835    8878    9773    33719   11085   368     152000
        256     10462   9566    10401   166465  11762   549     143000
        512     10643   9697    10578   33848   11988   377     141000
        1024    11244   10338   11171   31304   12735   393     133000
        2048    17111   15038   17051   175422  19106   839     88000
        4096    20005   18359   19966   43082   21692   552     75000
        8192    25461   22549   25267   45077   28202   900     59000
        16384   34067   31071   33676   216961  37988   1524    44000
        32768   51525   48248   51276   217021  55570   1819    30000
        65536   89586   84829   89306   262936  95165   2394    17000
 

It seems in most size Ivybridge also has less latency then Haswell. 

 

Even Haswell has a higher frequency, but it seems to have bad latency performance with Ivybridge, and with high worload running on CPU, it will cause very bad latency result on Haswell. 

 

Any idea of how to improve? Which parameter I should apply for Haswell when using it in a low latency environment. 

Thanks,

Eugene

 

 


 

0 Kudos
4 Replies
huang__yixuan
Beginner
1,669 Views

I also attach sysjitter result for Ivybridge and Haswell.

Used following test command: 

./sysjitter ‐‐runtime 10 200 | column ‐t

Node2: Ivybridge

Node3: Haswell

 

It seems Haswell has higer jitter. 

0 Kudos
McCalpinJohn
Honored Contributor III
1,669 Views

The minimum latency differences are not very large, and could perhaps be explained by hardware differences -- the Xeon E5 v2 processors have a single ring with 10 stops, while the 14-core Xeon E5 v3 processor has two rings (one with 8 stops and one with 10 stops) that are bridged together.

The Haswell processor has different behavior when making frequency changes, with all cores changing frequency at the same time.  (See https://dl.acm.org/citation.cfm?id=2863697.2864672 for more details.)    The stall time associated with frequency transitions varies by processor family, model, and starting and ending frequencies, but 10 microseconds is a "typical" value (e.g., http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.398.2413&rep=rep1&type=pdf). ; In your case, the thread doing the work may be stalled by frequency changes that are triggered by a different core (due to increase or decrease in the max all-core Turbo frequency as other cores are activated or idled).

The Haswell processor (at least in the Xeon E5 v3 platforms I have tested) also has ~10 microsecond processor stalls when the upper 128-bits of the SIMD pipelines are enabled.  Some notes are at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/710248#comment-1896717. ; In that study, I did not attempt to see if there is also a processor stall when the upper 128-bits of the SIMD pipelines are turned off, nor did I look for cross-core effects.  This phenomenon has cropped up repeatedly in my testing, even if I thought that I was controlling the code to make sure I was using 256-bit SIMD instructions all the time or not at all.  The biggest problem is the compiler's use of optimized memcpy() and memset() routines that use 256-bit SIMD instructions.   The gcc compiler generates these frequently for zeroing idioms and (if I understand correctly) for initializing structs -- even if the compiler optimization level is set low enough that vectorization is not attempted in general.  I had to experiment to find idioms that would confuse the compiler so that it would not perform this substitution.

0 Kudos
huang__yixuan
Beginner
1,669 Views

Hello McCalpin, 

Thanks for your reply. Compare with the first result from sfnt-pingpong which under high workload, for example, size=512, node2 (Ivybridge) is 9241, meanwhile, node3 (Haswell) is 23294, it is 2.5 times difference.

Do you think if choose 10/12 cores should be better than 14 cores for Haswell? Or I should not use Turbo enabled which make frequency higher.  

Do you have any idea if that phenomenon is same on Broadwell / Skylate processors? 

For low latency environment, which should be preferred to have? 

Thanks,

Eugene

0 Kudos
McCalpinJohn
Honored Contributor III
1,669 Views

I have never had access to one of the Xeon E5 v3 processors based on the 18-core die, so I can't provide any direct comparisons.   One small difference is that the Broadwell "medium" die has 15 cores (vs 12), so the 14-core Xeon E5 v4 is on a slightly smaller ring than the 14-core Xeon E5 v3.  Probably not a big difference.

As far as I can tell, Haswell EP, Broadwell EP, SKX and CLX all have the same issues with processor stalls related to frequency changes and powering up the upper bits of the SIMD pipelines.   On SKX (and presumably CLX) there is one stall for enabling the upper 128-bits of the 256-bit SIMD pipelines (same as Haswell EP and Broadwell EP), but there is a second stall if you use 512-bit SIMD instructions and have to power up the AVX512 unit(s).   These can be avoided if you avoid using 256-bit or 512-bit SIMD instructions, but in a world full of shared libraries this can be difficult.  (I also see the kernel executing 256-bit SIMD instructions and triggering these stalls.)

With the SKX and CLX processors it is important to avoid using core C1E state if you want to minimize stalls due to frequency changes.  I don't remember if the kernel boot option "intel_idle.max_cstate=0" prevents core C1 or core C1E states from being used -- it is hard to tell because using that option disables the kernel interfaces that I would use to track these states.  :-(     On a system with core C1E in use, an idle core will ramp down to the "maximum efficiency frequency" (typically 1.2 GHz for Haswell and Broadwell, 1.0 GHz for SKX/CLX), so it will require a frequency change stall when it comes out of idle.  On my SKX 8160 system that is in this configuration, the delay from "wake-up" to the transition to full frequency looks like a random value between 0 and 1000 microseconds -- suggesting that the frequency changes only happen on millisecond intervals.  The stall associated with the frequency change is only about 10-15 microseconds, but the processor runs at 1.0 GHz for an average of ~0.5 milliseconds before the frequency change occurs.  On a system without Core C1E in use, my AVX512 tests start out at the maximum non-AVX Turbo frequency, but execute the AVX512 instructions at ~1/4 of the expected throughput (since only the bottom 128-bits of the SIMD pipelines are enabled), then there is a stall while the upper 128-bits of the AVX2 pipelines are enabled and the frequency is dropped to the maximum AVX256 Turbo frequency, then execution proceeds at ~1/2 of the expected throughput, then there is a stall while the AVX512 pipelines are enabled and the frequency is dropped to the maximum AVX512 Turbo frequency, after which execution proceeds at the expected throughput....  One benefit of SKX/CLX compared to HSW is that 256-bit instructions are divided into "high-power" and "low-power" versions, so if all you do is load and store with 256 bits, it does not trigger a frequency change. 

Most of the recommendations that I have seen for low-latency system configuration are related to the C-states and to (explicit) p-state changes.  I don't think that I seen anyone else point out that SIMD width changes result in stalls.  (Sometimes SIMD width changes cause frequency changes, but Xeon E5 v3, I found that SIMD width changes cause stalls even when the frequency does not change -- unless the frequency is already at the "maximum efficiency frequency".)

 

0 Kudos
Reply