Re: 4 threads are slower than 2 on quadcore

ilia_rassadzin · ‎12-14-2007

Hello all!

I have a Windows application which can spawn different number of threads which put a lot of load on CPU (vector and matrix operation). I have 2 boxes where I tested my app, one is xeon quadcore and another is 2 dual-core xeons.
For all of these boxes:
There is some (like 30-40%) speedup for 2 threads over 1 thread and CPU utilization is pretty high (above 90%) on those CPU where app is executed. When spawning 4 threads the application runs about 5% slower than 2 threads and CPU utilization jumps from high 60% to 99%
Why is that happens?

I tried to use trial version of Thread Profiler but it crashes when displaying results after application is finished with a message like "unknown error or not enough memory". I have 4 GB RAM so it is more likely unknown error.

Thanks in advance.

TimP · ‎12-14-2007

If the active cores for 2 threads aren't getting higher usage, it looks like there is already a problem there. Are these tests under Intel OpenMP or -Qparallel with KMP_AFFINITY set?
If you are running 32-bit Windows, I agree that you would not benefit from more than 4GB RAM, but you might already be short of memory, depending on your application.

ilia_rassadzin · ‎12-14-2007

tim18:
If the active cores for 2 threads aren't getting higher usage, it looks like there is already a problem there. Are these tests under Intel OpenMP or -Qparallel with KMP_AFFINITY set?
If you are running 32-bit Windows, I agree that you would not benefit from more than 4GB RAM, but you might already be short of memory, depending on your application.

My point iswhile running 2 threads active cores are getting "right" usage about high nineties (about 97%) but when running 4 threads cores are used 70-80%. How to explain that?

It is kind of legacy application, some older compiler is used. It does not support OpenMP.

Threads are started with CreateThread().

My application runs 5 mins with a small number of counters sampling every second and I watched memory usage and there was enough memory.

jimdempseyatthecove · ‎12-14-2007

The 90% utilization for each of 2 threads does not mean that each thread is performing useful work for the duration of its 90% of each core. This is especially true in light of the fact that the two thread test only produced ~35% improvement as opposed to a theoretical maximum of 100% improvement. Or to put it another way, about 65% worth of one of two cores (spread across both cores) is wasting time. You did not mention what percentage of utilization you observe when running with single thread.

A profiler may show you where the time is wasted. There is overhead in starting and stopping threads. So if the parallel work loops are short lived you could decrease performance. Another area can cause wasted time is in synchronization (barrier, critical section, atomic, InterlockedComparexxx, etc...). Again, a profiler may tell you where the bottleneck is located.

If you are having problems with VTune, try using CodeAnalyst (from AMD.com). You can perform timer based profiling on Intel Processors. This should be good enough to locate the hot spots.

Jim Dempsey

jimdempseyatthecove · ‎12-14-2007

>>
It is kind of legacy application, some older compiler is used. It does not support OpenMP.

Threads are started with CreateThread().
<<

I would venture to guess that when you run the profiler you will find a great deal of time is spent in inter thread communication sections and in particular within a SpinLock call. SpinLocks are typically used to reduce latency at the expense of increasing overhead.

If you have the source code and the old compilers then with a little bit of work you might be able to significantly improve the performance of your application.

A poorly written SpinLock can eat up an inordinant amout of time performing uselessother thread interfering work. Microsoft's MSDN as well as some of the Intel .PDF files can instruct you on to how to write better sychronization code.

My suspicion is your system is tied up performing the Bart Simpson "Are we there yet, are we there yet, are we there yet, ..."

If you do not have the skills to make the changes you can always contract it out. There are forum members here, like myself, that perform contract work.

Jim Dempsey

ilia_rassadzin · ‎12-14-2007

JimDempseyAtTheCove:
The 90% utilization for each of 2 threads does not mean that each thread is performing useful work for the duration of its 90% of each core. This is especially true in light of the fact that the two thread test only produced ~35% improvement as opposed to a theoretical maximum of 100% improvement. Or to put it another way, about 65% worth of one of two cores (spread across both cores) is wasting time. You did not mention what percentage of utilization you observe when running with single thread.

A profiler may show you where the time is wasted. There is overhead in starting and stopping threads. So if the parallel work loops are short lived you could decrease performance. Another area can cause wasted time is in synchronization (barrier, critical section, atomic, InterlockedComparexxx, etc...). Again, a profiler may tell you where the bottleneck is located.

If you are having problems with VTune, try using CodeAnalyst (from AMD.com). You can perform timer based profiling on Intel Processors. This should be good enough to locate the hot spots.

Jim Dempsey

Thank you, Jim.

I observed about 100% CPU usage with a single thread, never less than 98%.
This application normally runs about 3-10 hours depending on the complexity of input data, so threads are not short lived.
I have some aggregation part which uses synchronization, but for this test I commented it out, i.e. the application does nothing useful just some floating point math. Synchronization takes about 3% of total time.
Slowdown on 4 cores (comparing with 2 cores) occurs even when there is no synchronization at all.

ilia_rassadzin · ‎12-14-2007

JimDempseyAtTheCove:
>>
It is kind of legacy application, some older compiler is used. It does not support OpenMP.
Threads are started with CreateThread().
<<

I would venture to guess that when you run the profiler you will find a great deal of time is spent in inter thread communication sections and in particular within a SpinLock call. SpinLocks are typically used to reduce latency at the expense of increasing overhead.

If you have the source code and the old compilers then with a little bit of work you might be able to significantly improve the performance of your application.

A poorly written SpinLock can eat up an inordinant amout of time performing uselessother thread interfering work. Microsoft's MSDN as well as some of the Intel .PDF files can instruct you on to how to write better sychronization code.

My suspicion is your system is tied up performing the Bart Simpson "Are we there yet, are we there yet, are we there yet, ..."

If you do not have the skills to make the changes you can always contract it out. There are forum members here, like myself, that perform contract work.

Jim Dempsey

not the case, it slows down when synchronization is off, producing no output. I just measure execution time.
I do have enough skills and ability to read documentation.

jimdempseyatthecove · ‎12-14-2007

If explicit synchronization is not causing the interference then the slowdown is most likely due to adverse cache interaction.

As a simple test, on your quad core processor system, setup to run with 2 threads. While running, use the Task Manager, open processes, right click on your application and pick set affinity. I think on your system processors 0 and 2 will be on seperated dies and therefore will have seperate cache.

What is not known is if the O/S will randomly context switch the application threads between the two processors. If that happens then cache on each processor has data from other processor and must get reloaded.

A second cause for adverse cache interaction is a poorly written application (with regards to cache access).

If, for example you have an array of 3D vectors, where X's are stored in A(1), A(4),A(7),..., Y's are stored A(2), A(5), A(8), Z's are stored A(3), A(6), A(9),...

It would be bad form for one thread to work on all the Xs, another on all the Ys and a 3rd on all the Zs. Instead, the route to choose would be for each thread to work on Xs, Ys, Zs within a zone in the array. This way a write to an X will not evict a Y and Z on the other threads. If you want each thread to work one on all Xs, another on all Ys, and a 3rd on all Zs, then use seperate arrays one for X one for Y one for Z

There are other similar bad programming practices that may have worked well way back when a cache line was 32-bits wide. But now as 128 bits they practice no longer works well.

Jim Dempsey

levicki · ‎12-14-2007

Is your application perhaps memory bound? You should use VTune to check for high bus utilization and cache misses/evictions in your code.

Another thing to try is to manually lock each thread to one of the cores so that Windows cannot juggle them around.