Haswell E5-2697V3 has bad network latency when has workload on cpu

huang__yixuan · ‎09-04-2019

Hello All, recently, I met one issue when I used Haswell e5-2697v3.

Compare with Ivybridge, Haswell will cause more latency even in high workload (most of CPUs core with 100%).

My environment has the following systems.

Node1: x3650M4 Ivybridge (E5-2660 v2*2 10cores) with 32GB. (Turbo enable, HT enable, running 2.6GHz)

Node2: x3650m4 Ivybridge (E5-2660 v2*2 10cores) with 32GB (Turbo enable, HT enable, running 2.6GHz)

Node3: x3550M5 Haswell (E5-2697v3*2 14cores) with 64GB. (Turbo enable, HT enable, running 3.0 GHz)

All of them use Solarflare 8522 plus. OS is CentOS 7.3. And isolate from core 2 - the last core. cmdline with those parameter:

nosoftlockup intel_idle.max_cstate=0 idle=poll mce=ignore_ce isolcpus=2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39 rhgb quiet

I setup Node 1 (192.168.1.5 ) running sfnt-pingpong server side, and use sfnt-pingpong tcp 192.168.1.5 to test. Command is

"onload -p latency ./sfnt-pingpong"

On Node2 and Node3, test command: taskset -c 1 ./sfnt-pingpong tcp 192.168.1.5

Node2: (Ivybridge)

# size mean min median max %ile stddev iter
1 8560 5474 8344 207867 13039 1507 174000
2 8549 5502 8336 201190 13063 1536 160000
4 8542 5554 8320 87953 13105 1468 168000
8 8536 5464 8315 197683 13072 1544 168000
16 8541 5553 8316 83730 13076 1473 168000
32 8597 5564 8384 193852 13011 1511 168000
64 8650 5593 8445 191432 13047 1492 173000
128 8744 5754 8535 82277 13243 1423 166000
256 9172 5959 8959 192138 13788 1543 158000
512 9335 6222 9120 87053 13953 1474 153000
1024 9984 6794 9762 202416 14720 1606 150000
2048 16220 10593 16087 196159 21930 2223 91000
4096 18932 12976 18847 95825 24935 2265 77000
8192 28032 16939 28929 229866 36352 4477 54000
16384 36081 26368 32452 246993 61994 9669 42000
32768 62912 55610 62802 127273 68710 2357 24000
65536 95040 75621 94538 248342 122023 4357 16000

Node3: (Haswell)

# size mean min median max %ile stddev iter
1 22396 7387 21925 54317 36919 5256 64000
2 22421 7765 21929 349632 36740 5406 67000
4 22421 7270 21920 380783 36657 5454 65000
8 22396 7064 21898 390140 36541 5478 65000
16 22525 8001 22053 53510 36666 5213 63000
32 22510 8242 22006 389380 36841 5480 65000
64 22579 8175 22128 216743 36745 5309 65000
128 22624 7745 22129 319480 36782 5366 67000
256 23109 8562 22646 53088 37497 5258 62000
512 23294 8716 22805 394718 37486 5511 63000
1024 23725 8141 23192 371761 38276 5439 64000
2048 30433 13477 29885 156189 46162 5947 48000
4096 33248 16947 32699 195047 49497 6023 45000
8192 43065 21124 39860 91991 71218 11168 34000
16384 64104 31076 66684 97144 82430 10796 23000
32768 79261 51736 78918 416892 96172 7035 19000
65536 139587 107606 138703 477958 167153 10244 11000

Haswell seems 3 times lower than Ivybridge.

If there has no workload on both systems. (Node1 running command './sfnt-pingpong')

Node2 and Node 3 running command: taskset -c ./sfnt-pingpong tcp 192.168.1.5)

Node2 (Ivybridge):

# size mean min median max %ile stddev iter
1 8214 7350 8074 175162 9870 665 182000
2 8201 7356 8107 170388 9792 588 183000
4 8178 7384 8084 82777 9750 423 183000
8 8190 7331 8063 166788 9762 602 183000
16 8192 7348 8054 81091 9856 505 183000
32 8213 7410 8122 169686 9757 586 182000
64 8272 7446 8174 167164 9980 582 181000
128 8502 7643 8387 80151 10112 478 176000
256 9114 8181 8972 166018 10762 633 164000
512 9241 8402 9151 82958 10864 464 162000
1024 9811 8902 9718 174087 11409 605 153000
2048 17464 14345 17425 169174 19672 1008 86000
4096 20115 17909 20081 90427 22178 666 75000
8192 25873 23347 25679 193017 29002 1356 58000
16384 34350 31148 34174 93980 38062 1331 44000
32768 50793 47413 50510 211654 54931 1728 30000
65536 90710 84953 90618 252523 95190 2205 17000

Node3: (Haswell):

# size mean min median max %ile stddev iter
1 9558 8648 9493 32775 10861 376 157000
2 9549 8641 9480 31600 10833 380 157000
4 9514 8685 9446 169606 10792 544 158000
8 9513 8590 9446 33736 10821 371 158000
16 9504 8632 9436 32550 10809 393 158000
32 9518 8545 9454 172947 10764 548 158000
64 9595 8598 9531 32384 10840 364 156000
128 9835 8878 9773 33719 11085 368 152000
256 10462 9566 10401 166465 11762 549 143000
512 10643 9697 10578 33848 11988 377 141000
1024 11244 10338 11171 31304 12735 393 133000
2048 17111 15038 17051 175422 19106 839 88000
4096 20005 18359 19966 43082 21692 552 75000
8192 25461 22549 25267 45077 28202 900 59000
16384 34067 31071 33676 216961 37988 1524 44000
32768 51525 48248 51276 217021 55570 1819 30000
65536 89586 84829 89306 262936 95165 2394 17000

It seems in most size Ivybridge also has less latency then Haswell.

Even Haswell has a higher frequency, but it seems to have bad latency performance with Ivybridge, and with high worload running on CPU, it will cause very bad latency result on Haswell.

Any idea of how to improve? Which parameter I should apply for Haswell when using it in a low latency environment.

Thanks,

Eugene

huang__yixuan · ‎09-04-2019

I also attach sysjitter result for Ivybridge and Haswell.

Used following test command:

./sysjitter ‐‐runtime 10 200 | column ‐t

Node2: Ivybridge

Node3: Haswell

It seems Haswell has higer jitter.

McCalpinJohn · ‎09-04-2019

The minimum latency differences are not very large, and could perhaps be explained by hardware differences -- the Xeon E5 v2 processors have a single ring with 10 stops, while the 14-core Xeon E5 v3 processor has two rings (one with 8 stops and one with 10 stops) that are bridged together.

The Haswell processor has different behavior when making frequency changes, with all cores changing frequency at the same time. (See https://dl.acm.org/citation.cfm?id=2863697.2864672 for more details.) The stall time associated with frequency transitions varies by processor family, model, and starting and ending frequencies, but 10 microseconds is a "typical" value (e.g., http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.398.2413&rep=rep1&type=pdf). ; In your case, the thread doing the work may be stalled by frequency changes that are triggered by a different core (due to increase or decrease in the max all-core Turbo frequency as other cores are activated or idled).

The Haswell processor (at least in the Xeon E5 v3 platforms I have tested) also has ~10 microsecond processor stalls when the upper 128-bits of the SIMD pipelines are enabled. Some notes are at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/710248#comment-1896717. ; In that study, I did not attempt to see if there is also a processor stall when the upper 128-bits of the SIMD pipelines are turned off, nor did I look for cross-core effects. This phenomenon has cropped up repeatedly in my testing, even if I thought that I was controlling the code to make sure I was using 256-bit SIMD instructions all the time or not at all. The biggest problem is the compiler's use of optimized memcpy() and memset() routines that use 256-bit SIMD instructions. The gcc compiler generates these frequently for zeroing idioms and (if I understand correctly) for initializing structs -- even if the compiler optimization level is set low enough that vectorization is not attempted in general. I had to experiment to find idioms that would confuse the compiler so that it would not perform this substitution.

huang__yixuan · ‎09-04-2019

Hello McCalpin,

Thanks for your reply. Compare with the first result from sfnt-pingpong which under high workload, for example, size=512, node2 (Ivybridge) is 9241, meanwhile, node3 (Haswell) is 23294, it is 2.5 times difference.

Do you think if choose 10/12 cores should be better than 14 cores for Haswell? Or I should not use Turbo enabled which make frequency higher.

Do you have any idea if that phenomenon is same on Broadwell / Skylate processors?

For low latency environment, which should be preferred to have?

Thanks,

Eugene

McCalpinJohn · ‎09-05-2019

I have never had access to one of the Xeon E5 v3 processors based on the 18-core die, so I can't provide any direct comparisons. One small difference is that the Broadwell "medium" die has 15 cores (vs 12), so the 14-core Xeon E5 v4 is on a slightly smaller ring than the 14-core Xeon E5 v3. Probably not a big difference.

As far as I can tell, Haswell EP, Broadwell EP, SKX and CLX all have the same issues with processor stalls related to frequency changes and powering up the upper bits of the SIMD pipelines. On SKX (and presumably CLX) there is one stall for enabling the upper 128-bits of the 256-bit SIMD pipelines (same as Haswell EP and Broadwell EP), but there is a second stall if you use 512-bit SIMD instructions and have to power up the AVX512 unit(s). These can be avoided if you avoid using 256-bit or 512-bit SIMD instructions, but in a world full of shared libraries this can be difficult. (I also see the kernel executing 256-bit SIMD instructions and triggering these stalls.)

With the SKX and CLX processors it is important to avoid using core C1E state if you want to minimize stalls due to frequency changes. I don't remember if the kernel boot option "intel_idle.max_cstate=0" prevents core C1 or core C1E states from being used -- it is hard to tell because using that option disables the kernel interfaces that I would use to track these states. :-( On a system with core C1E in use, an idle core will ramp down to the "maximum efficiency frequency" (typically 1.2 GHz for Haswell and Broadwell, 1.0 GHz for SKX/CLX), so it will require a frequency change stall when it comes out of idle. On my SKX 8160 system that is in this configuration, the delay from "wake-up" to the transition to full frequency looks like a random value between 0 and 1000 microseconds -- suggesting that the frequency changes only happen on millisecond intervals. The stall associated with the frequency change is only about 10-15 microseconds, but the processor runs at 1.0 GHz for an average of ~0.5 milliseconds before the frequency change occurs. On a system without Core C1E in use, my AVX512 tests start out at the maximum non-AVX Turbo frequency, but execute the AVX512 instructions at ~1/4 of the expected throughput (since only the bottom 128-bits of the SIMD pipelines are enabled), then there is a stall while the upper 128-bits of the AVX2 pipelines are enabled and the frequency is dropped to the maximum AVX256 Turbo frequency, then execution proceeds at ~1/2 of the expected throughput, then there is a stall while the AVX512 pipelines are enabled and the frequency is dropped to the maximum AVX512 Turbo frequency, after which execution proceeds at the expected throughput.... One benefit of SKX/CLX compared to HSW is that 256-bit instructions are divided into "high-power" and "low-power" versions, so if all you do is load and store with 256 bits, it does not trigger a frequency change.

Most of the recommendations that I have seen for low-latency system configuration are related to the C-states and to (explicit) p-state changes. I don't think that I seen anyone else point out that SIMD width changes result in stalls. (Sometimes SIMD width changes cause frequency changes, but Xeon E5 v3, I found that SIMD width changes cause stalls even when the frequency does not change -- unless the frequency is already at the "maximum efficiency frequency".)