Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

performance when running one compiled app vs. multiple ones simultaneously

Nick2
New Contributor I
448 Views

So, this confused the daylights out of me.

I'm running on an Intel Core i7-3770 @ 3.4 GHz, 4 cores x hyperthreaded = 8 cores.

For case 1, I compiled on ifort 2013 SP1 Update 4, as:

      <Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" MultiProcessorCompilation="true" GenAlternateCodePaths="codeForAVX" IntegerKIND="integerKIND8" RealKIND="realKIND8" LocalSavedScalarsZero="true" FloatingPointExceptionHandling="fpe0" FloatingPointModel="source" FlushDenormalResultsToZero="true" Traceback="true" />

Case 2 is not important, other than the fact that it ran and consumed CPU cycles.  It was done using 2015 Update 3.

For Case 3, I ran ifort 2015 Update 4, and I added  GenAlternateCodePaths="codeForCommonAVX512"

I ran all 3 cases at the same time on the same machine, and the run times are as follows:

Case 1 239.852 s

Case 2 267.979 s

Case 3 260.182 s

Naturally, I wonder what is wrong with the 2015 Update 4 compiler, the running time went up!  So, my first question is, why did the running time with 2015U4 go up?

Then I tried running Case 3 with nothing else running on my computer, and the running time was 198.092 seconds.  So, there is a huge improvement using this compiler?

The last test I did was to compile the same code, using 2013SP1U4, but I compiled it to have AVX2 instructions (instead of AVX512), and I also added O3.  Then I ran this in standalone, and the running time was 228.417 seconds, or about 5% faster than Case 1.

I would attribute the standalone vs. 3 simultaneous cases running speed improvement of 5% to adding O3 optimization.

However, with 2015U4, if I'm running only 1 case, the running speed is much, much faster, and if I'm running 3 simultaneously (on an 8-core CPU!) the speed is much, much slower.

-I'm seeing the same performance degradation of Case 3 vs Case 1 on Xeon E5-2699v3 and also on E5-1650 v3, but it's almost negligible on Xeon X5690.  Why is there this performance degradation when running more than one case at the same time on what seems like any newer CPUs?

-Why are the 2015U4 results so much slower when the compiled executable isn't the only thing running? 

-Or, am I the only person who has seen this?

I am tempted to use 2015U4 for the running speed I get when I run just one case ... but that means I need to buy a new CPU for each case I want to run.

0 Kudos
4 Replies
Steven_L_Intel1
Employee
448 Views

The Update 4 compiler is identical to the Update 3 compiler - in this update, the compiler did not change - so that isn't the issue.

You don't have 8 cores. You have 4 cores with 8 threads. There are MANY factors that can cause variable timing, including cache, background activities, virus scans and more.  Worse, when you run programs simultaneously, you introduce even more unpredictable behaviors with the schedulers and preemption.

An often recommended method for doing performance testing is to run ONE instance of the program multiple times, throwing out the best and worst cases and repeating until the average deviation becomes small enough. It may take dozens of runs to get there.

0 Kudos
TimP
Honored Contributor III
448 Views

As Steve hinted, it will be difficult to get consistent performance when running multiple applications with hyperthread enabled.  If you don't want manually to set them to separate cores under task manager after starting them, you could build them with OpenMP, run them in separate windows, and put distinct settings for OMP_PROC_BIND on them, all to minimize the time they spend on the same core.

The 56xx CPU has the additional factor that not all cores are equal.  In the usual BIOS, the last 2 of the 6 cores on each CPU are those which don't share internal data paths.  You would want to set up so that you don't run muiltiple applications on either of the first 2 pairs of cores.  I never enable hyperthread on CPUs of 4 or more cores if there is an option to disable it, although there are scenarios where HT could be useful.

On a multiple CPU platform you also need to pin each application to one CPU for maximum performance.  Remote memory access can easily degrade performance by 30%, just as pairing applications on a single core may do.

0 Kudos
jimdempseyatthecove
Honored Contributor III
448 Views

Don't forget about the CPU's that support Turbo mode

Or to put it a different way, they throttle down when the core gets hot (heaver work load). And then there is the L3 or Last Level Cache contention.

Jim Demspey

0 Kudos
Nick2
New Contributor I
448 Views

Ok, this makes more sense now.

I tried to run more controlled tests on the i7-3770 @ 3.4 GHz, and I wasn't able to reproduce the previous discrepancy of 2015U4 being worse than 2013SP1U4.  (2015U4 is better by about 12% to 13%)  But, I did repeat the median running time of 3 cases together vs. just one at a time to be 16% to 17% worse, which makes sense given the turbo frequency.

I compared 2013SP1U4AVX2 vs 2015U4AVX512 (both were O3, and all other flags are the same); I also looked into Windows 8.1's CPU scheduler vs setting CPU affinity from task manager shortly after the sequence begins running.  The CPU affinity setting didn't make much difference for me in this set of tests (though I should probably repeat this on the E5-2699v3)

The "wrong affinity" (where I set two executables to run on the same physical core) gives me a run time that's about 17% faster than running the two cases separately one after another "with turbo", so I still get a bit of benefit from hyper-threading if I have a large number of independent cases.

If anyone is interested in my results:

  Run 1 Time Run 2 Time Run 3 Time
2013SP1U4AVX2    3 together 254.029 254.746 254.699
2013SP1U4AVX2    3 together 254.035 254.879 254.317
2013SP1U4AVX2    3 together  +  affinity 260.951 259.077 259.467
2013SP1U4AVX2    3 together  +  affinity 253.476 253.68 254.064
2015U4AVX512    3 together 224.446 224.586 223.945
2015U4AVX512    3 together 226.375 226.078 226.203
2015U4AVX512    3 together  +  affinity 223.34 224.027 223.324
2015U4AVX512    3 together  +  affinity 220.253 225.738 219.16
2013SP1U4AVX2    1 alone 215.446    
2013SP1U4AVX2    1 alone 217.835    
2013SP1U4AVX2    1 alone + affinity 219.463    
2015U4AVX512    1 alone 190.447    
2015U4AVX512    1 alone 188.971    
2015U4AVX512    1 alone  +  affinity 190.661    
2015U4AVX512    2 together  +  wrong affinity 323.886 323.058  
       
       
       
2013SP1U4AVX2    3 together median 254.508    
2015U4AVX512    3 together median 224.2365    
2013SP1U4AVX2    1 alone median 217.835    
2015U4AVX512    1 alone median 190.447    
       
       
2013SP1U4AVX2    3 vs 1 quotient 1.16835219    
2015U4AVX512    3 vs 1 quotient 1.17742206    
3 together 2015 vs 2013 quotient 0.88105875    
1 alone 2015 vs 2013 quotient 0.87427181    
0 Kudos
Reply