Ok, this makes more sense now

Nick2 · ‎06-18-2015

So, this confused the daylights out of me.

I'm running on an Intel Core i7-3770 @ 3.4 GHz, 4 cores x hyperthreaded = 8 cores.

For case 1, I compiled on ifort 2013 SP1 Update 4, as:

Case 2 is not important, other than the fact that it ran and consumed CPU cycles. It was done using 2015 Update 3.

For Case 3, I ran ifort 2015 Update 4, and I added GenAlternateCodePaths="codeForCommonAVX512"

I ran all 3 cases at the same time on the same machine, and the run times are as follows:

Case 1 239.852 s

Case 2 267.979 s

Case 3 260.182 s

Naturally, I wonder what is wrong with the 2015 Update 4 compiler, the running time went up! So, my first question is, why did the running time with 2015U4 go up?

Then I tried running Case 3 with nothing else running on my computer, and the running time was 198.092 seconds. So, there is a huge improvement using this compiler?

The last test I did was to compile the same code, using 2013SP1U4, but I compiled it to have AVX2 instructions (instead of AVX512), and I also added O3. Then I ran this in standalone, and the running time was 228.417 seconds, or about 5% faster than Case 1.

I would attribute the standalone vs. 3 simultaneous cases running speed improvement of 5% to adding O3 optimization.

However, with 2015U4, if I'm running only 1 case, the running speed is much, much faster, and if I'm running 3 simultaneously (on an 8-core CPU!) the speed is much, much slower.

-I'm seeing the same performance degradation of Case 3 vs Case 1 on Xeon E5-2699v3 and also on E5-1650 v3, but it's almost negligible on Xeon X5690. Why is there this performance degradation when running more than one case at the same time on what seems like any newer CPUs?

-Why are the 2015U4 results so much slower when the compiled executable isn't the only thing running?

-Or, am I the only person who has seen this?

I am tempted to use 2015U4 for the running speed I get when I run just one case ... but that means I need to buy a new CPU for each case I want to run.

Steven_L_Intel1 · ‎06-18-2015

The Update 4 compiler is identical to the Update 3 compiler - in this update, the compiler did not change - so that isn't the issue.

You don't have 8 cores. You have 4 cores with 8 threads. There are MANY factors that can cause variable timing, including cache, background activities, virus scans and more. Worse, when you run programs simultaneously, you introduce even more unpredictable behaviors with the schedulers and preemption.

An often recommended method for doing performance testing is to run ONE instance of the program multiple times, throwing out the best and worst cases and repeating until the average deviation becomes small enough. It may take dozens of runs to get there.

TimP · ‎06-18-2015

As Steve hinted, it will be difficult to get consistent performance when running multiple applications with hyperthread enabled. If you don't want manually to set them to separate cores under task manager after starting them, you could build them with OpenMP, run them in separate windows, and put distinct settings for OMP_PROC_BIND on them, all to minimize the time they spend on the same core.

The 56xx CPU has the additional factor that not all cores are equal. In the usual BIOS, the last 2 of the 6 cores on each CPU are those which don't share internal data paths. You would want to set up so that you don't run muiltiple applications on either of the first 2 pairs of cores. I never enable hyperthread on CPUs of 4 or more cores if there is an option to disable it, although there are scenarios where HT could be useful.

On a multiple CPU platform you also need to pin each application to one CPU for maximum performance. Remote memory access can easily degrade performance by 30%, just as pairing applications on a single core may do.

jimdempseyatthecove · ‎06-18-2015

Don't forget about the CPU's that support Turbo mode

Or to put it a different way, they throttle down when the core gets hot (heaver work load). And then there is the L3 or Last Level Cache contention.

Jim Demspey

Nick2 · ‎06-18-2015

Ok, this makes more sense now.

I tried to run more controlled tests on the i7-3770 @ 3.4 GHz, and I wasn't able to reproduce the previous discrepancy of 2015U4 being worse than 2013SP1U4. (2015U4 is better by about 12% to 13%) But, I did repeat the median running time of 3 cases together vs. just one at a time to be 16% to 17% worse, which makes sense given the turbo frequency.

I compared 2013SP1U4AVX2 vs 2015U4AVX512 (both were O3, and all other flags are the same); I also looked into Windows 8.1's CPU scheduler vs setting CPU affinity from task manager shortly after the sequence begins running. The CPU affinity setting didn't make much difference for me in this set of tests (though I should probably repeat this on the E5-2699v3)

The "wrong affinity" (where I set two executables to run on the same physical core) gives me a run time that's about 17% faster than running the two cases separately one after another "with turbo", so I still get a bit of benefit from hyper-threading if I have a large number of independent cases.

If anyone is interested in my results:

	Run 1 Time	Run 2 Time	Run 3 Time
2013SP1U4AVX2 3 together	254.029	254.746	254.699
2013SP1U4AVX2 3 together	254.035	254.879	254.317
2013SP1U4AVX2 3 together + affinity	260.951	259.077	259.467
2013SP1U4AVX2 3 together + affinity	253.476	253.68	254.064
2015U4AVX512 3 together	224.446	224.586	223.945
2015U4AVX512 3 together	226.375	226.078	226.203
2015U4AVX512 3 together + affinity	223.34	224.027	223.324
2015U4AVX512 3 together + affinity	220.253	225.738	219.16
2013SP1U4AVX2 1 alone	215.446
2013SP1U4AVX2 1 alone	217.835
2013SP1U4AVX2 1 alone + affinity	219.463
2015U4AVX512 1 alone	190.447
2015U4AVX512 1 alone	188.971
2015U4AVX512 1 alone + affinity	190.661
2015U4AVX512 2 together + wrong affinity	323.886	323.058



2013SP1U4AVX2 3 together median	254.508
2015U4AVX512 3 together median	224.2365
2013SP1U4AVX2 1 alone median	217.835
2015U4AVX512 1 alone median	190.447


2013SP1U4AVX2 3 vs 1 quotient	1.16835219
2015U4AVX512 3 vs 1 quotient	1.17742206
3 together 2015 vs 2013 quotient	0.88105875
1 alone 2015 vs 2013 quotient	0.87427181

performance when running one compiled app vs. multiple ones simultaneously