I have a strange issue, which I think is pointing at a design problem in my software. But I'm not sure if I'm correct here.
I have made a program for audio processing that I normally test on my laptop. It's highly optimized using SSE2 code and Intel IPP. I've now bought a new pc, which has a CPU that's almost twice as fast (CPUBenchmark.com: 9600 vs 5800, 2.0 GHz i7-QM2630 with 6 MB cache vs. 3.1 GHz i7-4770S with 16 MB cache). So, I tried to run my program on that new pc and found out that the performance is nearly identical (less than 5% faster).
I ran a benchmark, and found that the memory was slower than on my old pc, because only one memory bank was used. (Benchmark: old pc 1600, new 1100). So I placed a 2nd memory module in it for parallel access, and now it reaches 2000.
I ran my program again, and it does run faster - but only about 5% faster than before. I've ran a full benchmark suite and every test in it is about twice as fast as on my laptop, but my software is only about 10% faster.
The only thing in the benchmark that's also only about 10% faster on the new system is memory latency. Which leads me to suspect (without further proof at this moment) that I'm doing something in my code that causes it to be limited by the memory latency.
One other clue: I can run my software on a single core or on 2 cores. If I run it on 2 separate cores, the performance (throughput) improves a lot, the CPU load of the first core drops a bit less than the extra core increases. But... if I split it over 2 cores that are shared (via Hyperthreading), the first core drops almost nothing and the 2nd one rises a lot (5% drop vs. 60% rise).
Am I right to suspect caching problems? Probably because I would be using cache lines that are aliased? (My code uses blocks of 16 kB memory, which can easily cause aliasing issues, but I'm allocating them with some data in between to avoid that. Or so I think. Any help is appreciated...
One thing to note is that the SSE2 instruction set only provides access to a fraction of the peak performance of either of your processors.
For integer arithmetic, SSE2 provides most of the performance capability of the Intel Core i7-2630QM (Sandy Bridge), but for floating-point arithmetic you can double the peak performance by switching to the AVX instruction set. (The first generation AVX instruction set supports packed 32-bit and 64-bit floating-point arithmetic in the 256-bit registers, but support for packed integer arithmetic in the 256-bit registers was deferred until AVX2 -- initially supported in the Haswell core.)
Your newer Intel Core i7-4770S (Haswell) system provides at least three additional improvements, but all of them require using the AVX and/or AVX2 instruction set extensions.
- The AVX2 instruction set provides a large set of packed integer instructions for the 256-bit AVX registers. This can provide a factor of two improvement in peak integer signal processing performance relative to your previous system.
- For floating-point arithmetic, Haswell's implementation of AVX2 supports two Fused Multiply-Add functional units, rather than one adder and one multiplier, providing another doubling of peak performance for some workloads -- but only if the new instructions are used.
- The Haswell core also supports twice the L1 and L2 cache bandwidth of your earlier system, but you have to use the 256-bit AVX or AVX2 instructions to get the increased L1 bandwidth.
As iliyapolak recommended, you should start with profiling the execution of the code to find out where it is actually spending its time. Once the distribution of execution time is known, you will need to do some analysis to determine whether the most heavily used functions are taking the expected amount of time. If they are slower than they should be, then you will need to look into whether the code is generated properly and whether there are unexpected rates of cache misses or other microarchitectural hazards.
Looking over your results, I would say that it is good news that adding the extra memory module only provided a 5% performance improvement -- that suggests that the performance is compute-bound, so the extra performance of the newer processor should be helpful. (It is possible that the performance is memory-latency-bound, but that is not common in audio signal processing applications unless there is a serious cache conflict problem.)
The performance drop when running two threads on one core could easily be due to simply sharing the cache. If your data blocks are 16 KiB, then you have plenty of extra L1 data cache when running 1 thread, but with two threads you would completely fill the L1 data cache with one block per thread -- leaving no extra room for any other data for either thread. This typically leads to greatly increased L1 data cache miss rates.
I recommend that you start the performance analysis on a single thread, then look at threads on independent cores. HyperThreading is useful for improving overall system throughput, but it makes performance analysis much more difficult.
>>>I ran my program again, and it does run faster - but only about 5% faster than before. I've ran a full benchmark suite and every test in it is about twice as fast as on my laptop, but my software is only about 10% faster>>>
Does your audio processing software use or operates on FP values? If it does you will see some improvement while porting your code to 256-bit wide AVX instruction and vectors. For example if you have code which consists from mul and add instructions (like polynomial Horner Scheme) then it can be executed on Port0 and Port1 by FMA instruction(AVX2) with latency of 5 cycles so it is a quite large improvement when compared to Sandy Bridge when those mul and add instructions had latency of 8 cycles.
Ok, thanks. I fear I'm going to have to install VTune on both systems to do comparisons and see where the performance improved and where it didn't. I have ran it through VTune many times in the past, but at the time I only looked for the hotspots and expensive memory reads and optimized those.
To make things weirder, I've added another system to the test, with a 4770K processor (should be slightly faster than the 4770S). On this system my software runs a lot better (about 30% less CPU load), but the benchmark suite is showing worse performance (the memory is a bit slower, among others, and certain CPU performance tests show much worse results). The memory latency on this system - according to the benchmark - is another 10% lower than that of the 4770S systems.
Since my code spends about 40% of its time inside IPP's FFT functions, I've ran a small program that only performs FFT's (on the same block of memory all the time), and it runs at nearly identical speed on the 4770S and 4770K, which is 1.7 times as fast as on my laptop - which matches the benchmarks of the CPU's (well, I had actually expected a bit more from the 4770K). So the CPU isn't the problem - something else is blocking it.
If I find out what's the problem I'll post it here - might be interesting for others as well.
Ok. Performance on the 4770K and 4770S machines turns out to be nearly identical for most of my software - what confused me for a while was that on the 4770S system the hyperthreaded cores are numbered 0/1 2/3 4/5 6/7 but on the 4770K it's 0/4 1/5 2/6 3/7. Which led me to believe the it handled hyperthreading much better...
Now, if I run a computation-intensive piece of code and use Hyperthreading, it gives nearly no performance improvement at all. On all 3 systems that I tried. And that makes sense.
I still see an improvement that's less big than expected (4770S and 4770K are now both about 1.5 times faster than my laptop, according to the benchmarks it should roughly be a factor 2 faster). That still doesn't fix the fact that a part of the software is almost the same speed on these much faster systems, but at least the other weird things are gone now. Will start running VTune on it tomorrow.
It is getting pretty hard to track your results without specific, quantifiable results. It would be very helpful if you could quantify the performance with some metric like 'work_done per clocktick' where work_done is something pertinent to your workload (like audio_packets_decoded or something like that).
This will help us compare performance across different cpus.
If you are on windows you can use the call below to get high resolution cycle time for your process.
The above call includes both user and kernel time but the cycles are 'reference cycles' so the cycles run at the same frequency as the TSC. Work gets done at core frequency... not at the TSC frequency. If you are trying to compare different processors, it would greatly simplify things if you set the frequency to be the same as the TSC.... or at least disable turbo mode.
I can't tell if everything is running at the same frequency or if other processes are interfering with your process, or if there is lots of idle time or if you are accurately seeing which HT threads are sharing a cpu, or if you are pinning the threads so you are sure that the threads are staying on the cores.
When using HT, some resources like the registers are duplicated (each HT thread has its own registers) and other resources are shared. The FP unit is shared (unless they've added a FP unit... which I doubt) so... HT would speed up your code if all you are doing is integer operations and then block while each HT thread uses the FP unit. If I understand correctly (looking at http://www.anandtech.com/show/6355/intels-haswell-architecture/8 first figure), with the Haswell FMA instruction then it is possible to get a simultaneous FMA on each HT thread.
I'm not seeing gains for HT even with Haswell FMA in floating point benchmarks.
The strange effect I see with HT is that allowing parallel sections compiled with Intel Windows compilers to run on all hyperthreads degrades the performance of subsequent serial (single thread) code sections, while pinning threads to distinct cores has relatively little effect on parallel performance but avoids degrading the serial sections.
Running the same benchmarks compiled with gnu compilers doesn't exhibit the degradation of serial sections, regardless of parallel settings.
>>>Now, if I run a computation-intensive piece of code and use Hyperthreading, it gives nearly no performance improvement at all. On all 3 systems that I tried. And that makes sense.>>>
I think that it depends more on thread level parallelism. For example numerical integration can be parallelized where multiple threads are performing Simpson rule integration by the domain division.
>>>If I understand correctly (looking athttp://www.anandtech.com/show/6355/intels-haswell-architecture/8 first figure), with the Haswell FMA instruction then it is possible to get a simultaneous FMA on each HT thread.>>>
IIRC all Execution Ports are shared by the HT threads beside the architectural registers and APIC. So from theoretical point of view on the Haswell physical core two threads can use both of the FMA units at the same time. I think that this is possible because front end can track specific thread uops.
Well I've figured out a few things.
First of all, something seems to be wrong with my new test system. I'm getting BSOD's... Oddly, at the moment I *am* seeing a performance gain when I use Hyperthreading, and in some cases the gain is even close to what I get when I use completely separate cores. But much seems to depend on whether I run my software as a plugin in other software or as a stand alone program (this makes no sense because the threads that are affected by Hyperthreading are doing the exact same thing in both cases). And the version that currently gives me a performance gain didn't do that a few days ago, with the same settings.
I should probably get rid of the cause of the BSOD's first, before I draw any conclusions (since measurements seem to vary wildly).
There are several - but all in the same file, so I'm suspecting a driver issue. A few days ago I had some weird behavior where core 1 was constantly used at 100% - kernel time. Turned out to be DCP calls. I then rebooted and started it in safe mode, after booting back to normal mode the hyperthreading behavior appears to have changed, but since that moment I'm also seeing the BSOD's. This doesn't make any sense to me though. Fortunately I have a few more test systems (I have 3 identical ones) so I'm going to try what happens on those. And probably reinstall Windows on the system with the BSOD's.
>>>A few days ago I had some weird behavior where core 1 was constantly used at 100% - kernel time. Turned out to be DCP calls.>>>
You should try to identify which driver(s) spent prolonged time in servicing DPC routines. You can do it by using Xperf tool.
Regarding BSOD's I can look at them when you will upload minidump files.
ilyapolak: Hm good idea to run that tool anyway (I need to know if I can reliably reach low latencies for my software). But the DPC timings issue disappeared after the restart. I'm running a measurement now (1 minute), highest DPC latency encountered so far is 79 us. Highest measured interrupt to process latency is 2504 us, which is too high (at least for what I want to do).
BSOD errors are 7a (KERNEL_DATA_INPAGE_ERROR, 3 times), f4 (CRITICAL_OBJECT_TERMINATION, 3 times) and 3b (SYSTEM_SERVICE_EXCEPTION, once). They started around the time of the reboot - but I just realized that that same day I also put in an extra stick of memory. So, I'll pull that out first and see if it's stable again.
.>>Highest measured interrupt to process latency is 2504 us, which is too high (at least for what I want to do).>>>
Still that is below Microsoft recommended threshold which is for the DPC ~100ms and should not trigger DPC_WATCHDOG_VOLATION BSOD.
CRITICAL_OBJECT_TERMINATION - I do not think if that BugCheck is directly related to new RAM addition. Of course nobody can know without prior debugging what was the chain of events which led to that BSOD.
>>>Hm good idea to run that tool anyway (I need to know if I can reliably reach low latencies for my software)>>>
I usually profile my software with VTune and with Xperf.It is a good idea to combine both of those tools.