Slow math performance on i7-5960X vs i7-3820

Duncan_M_ · ‎06-02-2015

Hi, I'm running my own C++ developed neural net simulation software.

I have two machines with an i7-5960X and an i7-3820 at similar clock speeds. The same code (single threaded) runs much more slowly on the i7-5960X, taking 120s vs 16s on the i7-3820. It unavoidably makes a lot of use of the exp function. In the same software, a simulation which doesn't use the exp function has similar performance on both machines, so it must be the math function.

I guess I'm missing the right optimisation settings, but no idea what. I'm at a pretty basic level with such.

It's compiling in Visual Studio 2010, with /O2 /Ot.

Any help appreciated.

Patrick_F_Intel1 · ‎06-02-2015

Hello Duncan,

Can you profile your exp routine and verify that it is the slowdown? If I were trying to debug a timing issue like this, first I would put a wrapper around the exp function and then time the number of times the exp routine is called and how the time used by the routine. It is best if the timing routine is fast and has a very fine granularity (nanoseconds?).

Are you using C++?

Here are some timing routines samples showing how to use QueryPerfornanceCounter (QPC) and the TSC. I also show the resolution of each (about 1000 nanoseconds for QPC and about 14 nanoseconds (10-30 clockticks) for TSC based timing routines. I haven't tested the variance of the timing routines lately. If each call to the exp code takes takes significantly longer than ~1000 nanoseconds for QPC then you can use the QPC code for timing... if the exp code takes < 1000 nanosecs then you end up mostly timing the QPC timing routines and you should use the TSC-based timing routines.

Here is some sample code:

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static double qpcFreqInv = 0.0;

__int64 QPC_get(void)
{
    LARGE_INTEGER llint;
    QueryPerformanceCounter(&llint);
    return llint.QuadPart;
}

void QPC_init(void)
{
    LARGE_INTEGER llint;
	if(qpcFreqInv == 0)
	{
		QueryPerformanceFrequency(&llint);
		qpcFreqInv = 1.0/(double)(llint.QuadPart);
	}
}

int main(void)
{
	__int64 tm_beg, tm_end;
	__int64 tsc_beg, tsc_end;
	int i=0, j=0;
	double tm_diff, tsc_freq;

	QPC_init();
	tm_beg = QPC_get();
    Sleep(1000); // do your work here
	tm_end = QPC_get();
	tm_diff = (double)(tm_end-tm_beg) * qpcFreqInv;
	// sleep timers generally wake up every 16ms on windows so won't wake up exactly at 1000 ms
	printf("Expected to sleep 1000 ms +- 16ms. Timer shows elapsed time= %f seconds\n", tm_diff);

	// now look
	tm_end = QPC_get();
	while(tm_end == QPC_get())
	{
		// spin until current timer != initial timer
		i++;
	}
	// now timer has changed
	tm_beg = tm_end;
	while(1)
	{
		// spin until current timer != initial timer
		j++;
		tm_end = QPC_get();
		if(tm_end != tm_beg) 
			break;
	}
	tm_diff = (double)(tm_end-tm_beg) * qpcFreqInv;
	// qpc has a resolution of about 1 usecond
	printf("difference in timers= %g nano seconds(1e-9 secs), qpc freq= %f\n", tm_diff*1.e9, 1.0/qpcFreqInv);
	printf("print out i,j values just so compiler won't optimize loops away. i= %d, j= %d\n", i, j);

	// now get TSC (time stamp counter) frequency
	tm_beg = QPC_get();
	tsc_beg = __rdtsc();
	// spin for 1 second
	while(1)
	{
		tm_end = QPC_get();
		tsc_end = __rdtsc();
		tm_diff = (double)(tm_end-tm_beg) * qpcFreqInv;
		if(tm_diff >= 1.0) // go for 1 second
		{
			break;
		}
	}
	tsc_freq = (double)(tsc_end - tsc_beg)/tm_diff;
	printf("spin for %f seconds. TSC freq= %f MHz\n", tm_diff, tsc_freq * 1e-6);

	// now do similar timer resolution loop for TSC
	tsc_beg = __rdtsc();
	while(1)
	{
		// spin until current timer != initial timer
		tsc_end = __rdtsc();
		if(tsc_end != tsc_beg) 
			break;
	}
	tsc_beg = tsc_end;
	while(1)
	{
		// spin until current timer != prev  timer
		tsc_end = __rdtsc();
		if(tsc_end != tsc_beg) 
			break;
	}
	tm_diff = (double)(tsc_end-tsc_beg) / tsc_freq;
	// tsc based timing has a resolution of about 10s of clockticks
	printf("difference in tsc timers= %g nanoseconds(1e-9 secs), tsc freq= %f\n", tm_diff*1.e9, tsc_freq);
	printf("tm_diff %g nanoseconds\n", 1e9*tm_diff);
	printf("Took about %g TSC clockticks to do last loop\n", tsc_freq*tm_diff);
    return 0;
}

All that having been said, is it possible that the cores you are running on are going to sleep or your code is getting moved to a core that is asleep? If you have IO or some system call in your code then you could be getting moved around to different cpus. You could pin your code to a cpu using 'start /affinity ...' or with taskmanager. Or if you have some lowest priority code running on all the cpus to keep the cpus from going to sleep. Or maybe disable Cstates in the bios.

Pat

TimP · ‎06-02-2015

Another question is whether you are running in x64 64 bit mode, perhaps on only one of those installations.

Patrick_F_Intel1 · ‎06-02-2015

And you might check if the haswell based system has the windows power options set to 'max performance'.

Bernard · ‎06-04-2015

I would advise also to run VTune analysis on your project on bothe of the machines.You should look at Back-End Pipeline Stalls events which will show you where in the CPU pipeline code execution stalled. At the beginning of testing I think that Pat approach should be used before starting full scale analysis with VTune.

Bernard · ‎06-04-2015

@Pat

IIRC exp function can be very fast and I think it executes faster that single clock period of HEPT timer(which QueryPerformance/Frequency) is based on. I suppose that in case of exp it could be less that 100 nanoseconds. By looking at Intel VML library exp function it seems that the execution speed is around ~13 clocks per element as stated in following link https://software.intel.com/sites/products/documentation/doclib/mkl_sa/112/vml/functions/_performanceall.html

Bernard · ‎06-04-2015

As followup to my latest post I would like to add that exact implementation of exp function is not known at this time.So I would like to advise to put a breakpoint on the call to library exp function and investigate its implementation. You may look at this IDZ post which is partially related to your problem:

https://software.intel.com/fr-fr/forums/topic/392924

David_H_7 · ‎12-23-2015

In case this this thread is still of interest, I've recently had a similar problem. An i7-5960x installed on an Asus X99 Deluxe motherboard took 41 secs to perform a parallel processing calculation that took 6 secs on my previous MSI GT70 laptop with an i7-3610QM processor. After recovering from a sinking feeling in my gut (having spend many $s on this new kit) and quite a bit of googling I discovered that the motherboard needs tuning (as in overclocking) even to get the standard 3Ghz clock speed from the processor. Whilst the Asus X99 is an enthusiasts board I had not anticipated that from start up the processor would be running at a very modest speed (about 800 MHz); after tuning using the Asus AI Suite the calculation took under 3 secs with the processor tunning at about 4GHz.

TimP · ‎12-23-2015

I haven't definitely identified what made the difference (I suspect BIOS update), but my Haswell laptop in original configuration would down-clock out of Turbo mode if all logical processors were in use, and would not recover for several seconds. As there is no BIOS option to restrict it to one logical processor per core, this meant taking measures at run time to set processor affinity and limit the total number of threads. If your application depends on cache locality, avoiding the normal Windows action of rotating across logical processors gains additional importance.

Microsoft appears to have no application controlled means of setting affinity, so I always run OpenMP applications against the Intel OpenMP library, even when using Microsoft OpenMP compilation. Microsoft once "committed" to implement OpenMP affinity but then apparently dismantled the team which was to support that small move toward current OpenMP standard/ Among other things, the Microsoft proposal was contingent on the equivalent of Intel's KMP_BLOCKTIME=0 (no timing out of OpenMP threads). It might be that improvement in Microsoft OpenMP could occur only with pressure and contribution of talent by Intel.

Under linux, on an early Haswell server, we observed the marked improvement in performance by using taskset to restrict the group of logical processor used by an application. Without that, performance would drop significantly after the application was idled and then resumed. At that time, the BIOS option to disable HyperThread was broken.

I no longer observe a marked performance drop on my laptop when using all logical processors, although throughput is still maximized by keeping one idle. I have checked this issue with both Windows 8.1 and 10, and there is no difference. Windows 10 blocked access to the BIOS setup menu, but upon returning to 8.1 there still is no option to control Hyperthreading or Turbo in the BIOS setup.

I haven't observed any ability of Intel OpenMP library to capture Cilk(tm) Plus workers and pin them to cores, such as has been demonstrated for applications written to pthreads on linux (which depends on the Intel OpenMP for linux based on pthreads). gnu OpenMP on Windows (which uses the pthreads layer over Microsoft threads) is markedly less capable of supporting affinity than the same library on linux or the Intel one.

So I would guess this Asus "tuning" involves some scheme for improving thread locality or tinkering with the schedule for cutting back clock speed under various patterns of logical processor use. I couldn't find definite information on that. Running a single thread at maximum turbo speedup might depend on moving it across cores quickly enough to avoid local heat buildup.

Bernard · ‎12-24-2015

@David H

I had similar issue with throttled CPU frequency. I solved it by tweaking setting of ASUS software which was responsible for lowering CPU operating frequency.