Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Cache speed problems.

Kasutaja_A_
Beginner
1,195 Views

I have Windows 7 64 bit, a Dell N5110 laptop with Intel i5-2450M. I was writing some program that uses a lot of memory and cache and I discovered my cache speeds are very slow.

Here is a benchmark (with a tool called pmbw benchmark) with the usual results I am getting (first page has the bandwidth):

http://www.docdroid.net/wuyx/plots-andres-pc.pdf.html

I think it should be closer to 100 GB/s than to 25 GB/s in the L1 region (<32KB). Even with the <2 KB block size benchmark I get less than 25 GB/s read bandwidth. This is the singlethreaded test, with multithreaded tests I get twice the bandwidth.

What is weird is that sometimes I do get around 100/50 GB/s read/write. For example, this morning after turning on my laptop the first thing I ran was the test and here are the results (I ran only read this time): 

http://www.docdroid.net/wx4a/plots-andres-pc-fast-read.pdf.html

Then I ran it again with both read and write tests (takes a couple of minutes), I got 25/50 GB/s read/write, then after that I got 25/12 GB/s read/write, after which it stayed like that. The reason why the second test got 50 GB/s write speed is because it ran the write tests first and read tests second, and by the time it finished with the write test my cache speeds had decayed.

I have also tested my memory with Sisoft Sandra cache bandwidth test and my own memory read loop, the results are similar. I ran the same pmbw benchmark at my school's computer class on i5-4200M, which gave consistently 160/80 GB/s read/write for small block sizes. So the problem doesn't seem to be in the testing software.

The higher speeds I get very rarely and then they quickly disappear. What could be causing this trouble? I checked that I have the latest BIOS and my Windows 7 power settings are on High Performance. I didn't have any background programs running while testing. The CPU is not overheating while the tests are running (only around 50C).

Is there some other benchmark, test, diagnostic I could run to pinpoint the problem? Is my cpu failing or is this some sort of a BIOS/driver/Windows/software issue?

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,195 Views

Just as a sanity check, the Intel Core i5-2450M is a dual-core Sandy Bridge mobile processor, with a base frequency of 2.5 GHz, a maximum Turbo frequency of 2.8 GHz using both cores, and a maximum Turbo frequency of 3.1 GHz using 1 core.

  • Using either SSE or AVX code, the Sandy Bridge core is capable of loading 2x16 Bytes per cycle.  At 3.1 GHz this puts the peak L1 DCache Read BW at 3.1*32=99.2 GB/s using 1 core. 
    • The best observed cache read bandwidth values of  ~91 GB/s are more than 90% of the peak bandwidth, which is both reasonable and expected.
  • Using both cores the maximum Turbo frequency drop leads to a peak Read BW of 2.8*64=179.2 GB/s.
  • Bandwidths for writes (in isolation) are 1/2 of the bandwidth for reads in each case.

These are "best case" values, and in the Sandy Bridge core there are quite a few mechanisms that can reduce the observed values by relatively large factors.  Some of the more common problems can be ruled out fairly quickly:

  • The pmbw "ScanRead256PtrUnrollLoop" test is coded directly in assembly language, so we can rule out incorrect instruction selection. 
  • The pmbw "ScanRead256PtrUnrollLoop" test only reads one array, so there should be no associativity conflicts.
  • The pmbw "ScanRead256PtrUnrollLoop" test performs no writes, so there should be no read/write ordering conflicts (false or real) to prevent the reads from being executed efficiently out-of-order.

Other potential problems are harder to diagnose without additional information, and I don't know how to investigate any of these on Windows.

  • Processor frequency can drop as low as 1.2 GHz due to standard power-saving mechanisms. (I don't know if the mobile processors have a lower minimum frequency than the server processors that I usually work with.)  Sometimes frequency is controlled by the BIOS, sometimes by the OS.  I have run across cases where BIOS-controlled frequencies adapted to load much more slowly than CPU-controlled frequencies, leading to erratic performance for the first few seconds of various benchmarks.  It seems plausible that a laptop might be more likely to be configured to save energy by responding more slowly to changes in load.
    • Even at 1.2 GHz the peak read BW is 1.2*32=38.4 GB/s, with 90% efficiency at 34.5 GB/s.  The ~23 GB/s measured values for the "slow" case are well below this -- but do correspond almost exactly to 90% efficiency at 800 MHz operation.
  • L1 and L2 caches are always synchronous in these processors, while the L3 cache may or may not be running at the same frequency.  (It runs at the same speed in the server parts -- I am not sure about the mobile processors).
    • Your "good" results show L1/L2/L3 read bandwidths of approximately 90/41/29, giving ratios of 1.00/0.46/0.32.
    • Your "slow" results show L1/L2/L3 read bandwidths of approximately 23/10.5/7.5, giving ratios of 1.00/0.43/0.33.
    • The fixed ratios suggest (but do not prove) that the problem is due to the processor running at the wrong frequency.

With this suggestion of a frequency setting problem, I would proceed by using the performance counters to count actual CPU cycles for each loop length in the benchmark.  Unfortunately I have absolutely no idea how one does this in the Windows operating system.

View solution in original post

0 Kudos
7 Replies
Bernard
Valued Contributor I
1,195 Views

Without in depth profiling it is very hard to understand what is the reason for those result fluctuation. You can download trial version of VTune and perform the analysis while the benchmark is running.

0 Kudos
TimP
Honored Contributor III
1,195 Views

If you are running 4 threads migrating randomly over the 4 logical processors and 2 physical processors, variations in apparent cache performance are to be expected.  If the instructions for the benchmark don't cover how to run on hyperthreads, and you have no BIOS option to disable them, you will need to study how to optimize number of threads and placement (probably 1 thread pinned to each core).

Is your own program using pthreads?  pthreads implementations for Windows aren't particularly popular as they don't have a reputation of optimization for that OS. I don't think it's feasible to guess which pthreads you use. You do of course need win7 sp1 for satisfactory hyperthread behavior.  

0 Kudos
McCalpinJohn
Honored Contributor III
1,196 Views

Just as a sanity check, the Intel Core i5-2450M is a dual-core Sandy Bridge mobile processor, with a base frequency of 2.5 GHz, a maximum Turbo frequency of 2.8 GHz using both cores, and a maximum Turbo frequency of 3.1 GHz using 1 core.

  • Using either SSE or AVX code, the Sandy Bridge core is capable of loading 2x16 Bytes per cycle.  At 3.1 GHz this puts the peak L1 DCache Read BW at 3.1*32=99.2 GB/s using 1 core. 
    • The best observed cache read bandwidth values of  ~91 GB/s are more than 90% of the peak bandwidth, which is both reasonable and expected.
  • Using both cores the maximum Turbo frequency drop leads to a peak Read BW of 2.8*64=179.2 GB/s.
  • Bandwidths for writes (in isolation) are 1/2 of the bandwidth for reads in each case.

These are "best case" values, and in the Sandy Bridge core there are quite a few mechanisms that can reduce the observed values by relatively large factors.  Some of the more common problems can be ruled out fairly quickly:

  • The pmbw "ScanRead256PtrUnrollLoop" test is coded directly in assembly language, so we can rule out incorrect instruction selection. 
  • The pmbw "ScanRead256PtrUnrollLoop" test only reads one array, so there should be no associativity conflicts.
  • The pmbw "ScanRead256PtrUnrollLoop" test performs no writes, so there should be no read/write ordering conflicts (false or real) to prevent the reads from being executed efficiently out-of-order.

Other potential problems are harder to diagnose without additional information, and I don't know how to investigate any of these on Windows.

  • Processor frequency can drop as low as 1.2 GHz due to standard power-saving mechanisms. (I don't know if the mobile processors have a lower minimum frequency than the server processors that I usually work with.)  Sometimes frequency is controlled by the BIOS, sometimes by the OS.  I have run across cases where BIOS-controlled frequencies adapted to load much more slowly than CPU-controlled frequencies, leading to erratic performance for the first few seconds of various benchmarks.  It seems plausible that a laptop might be more likely to be configured to save energy by responding more slowly to changes in load.
    • Even at 1.2 GHz the peak read BW is 1.2*32=38.4 GB/s, with 90% efficiency at 34.5 GB/s.  The ~23 GB/s measured values for the "slow" case are well below this -- but do correspond almost exactly to 90% efficiency at 800 MHz operation.
  • L1 and L2 caches are always synchronous in these processors, while the L3 cache may or may not be running at the same frequency.  (It runs at the same speed in the server parts -- I am not sure about the mobile processors).
    • Your "good" results show L1/L2/L3 read bandwidths of approximately 90/41/29, giving ratios of 1.00/0.46/0.32.
    • Your "slow" results show L1/L2/L3 read bandwidths of approximately 23/10.5/7.5, giving ratios of 1.00/0.43/0.33.
    • The fixed ratios suggest (but do not prove) that the problem is due to the processor running at the wrong frequency.

With this suggestion of a frequency setting problem, I would proceed by using the performance counters to count actual CPU cycles for each loop length in the benchmark.  Unfortunately I have absolutely no idea how one does this in the Windows operating system.

0 Kudos
Bernard
Valued Contributor I
1,195 Views

Before the testing run Process Explorer and measure the CPU load pay attention to DPC or Interrupts and to system threads thats mean those ones running inside the system. Left PE running and start your benchmark during the benchmark observe CPU activity and pay an attention to what was mentioned above.

0 Kudos
Kasutaja_A_
Beginner
1,195 Views

It turned out that it was a frequency issue. My processor was running at 800 Mhz. I downloaded this tool called Throttlestop to disable a signal called BD PROCHOT, which is some signal the motherboard sends to the CPU to slow down when some sensor gets too high temperatures. After disabling it, my processor runs fine: at the proper frequency and the cache speeds are alright (around 90 GB/s read and 45 GB/s write speed for L1 cache).

Weirdly my CPU runs at 2.9 GHz continuously under load (tried for a couple of minutes), but the base frequency should be 2.5 Ghz. Don't know if the Turbo lasts so long. I've not overclocked it, but it is a used laptop.

What I noticed now is that as my processor can run free, the temperatures can increase, but my fan doesn't seem to turn on. What might have happened is that my fan stopped working and some sensor on my motherboard burned out and started sending false positives. I will check my fan if something is stuck there or maybe it needs to be replaced.
 
Thanks for the help.
0 Kudos
Kasutaja_A_
Beginner
1,195 Views

Took my laptop apart and pulled out some loose metal and plastic parts. Also discovered that the plastic is broken where the screws from one of the display hinge attach, ouch. Luckily it also has some sort of a socket there which seems strong, hopefully it wont be falling apart soon.

With the loose parts out, the fan started working and the temperatures stay in range under load. Something was probably lodged in there (some loose parts fell out by themselves, not sure where they were, but it had to be the fan because it suddenly started working). Case closed for now I think.

0 Kudos
McCalpinJohn
Honored Contributor III
1,195 Views

Glad you found the problem.  Recent Intel processors do a very good job of protecting themselves against overheating, but a dead fan in a laptop is likely to result in the death of some piece of hardware fairly soon.

0 Kudos
Reply