Cost of synchronization on X5355

zoolii · ‎03-23-2010

Hi ,

I am writing a C++ application which process text in multiple threads. On my laptop which is Core2 duo
T8300 machine , the application is giving 30% to 50% increase in TPS over single threaded version(when executed in 2 threads).But when I testedsame applicationon a 8-core machine (2 x X5355 quad core) machine , it is not at all scaling . Actually TPS is going down, if the umber of threads is more than I have few questions here

1. X5355 is more than three years old, does the technology has something to do with this behavior? . Because core2 duo is newer thanQuad core X5355and the application is scaling there. Will it make any difference ifI test the applicationon Intel's latest quad core machines ?

2. Does OS has some impact on this ?. The Xeon X5355 machine is a Windows 2003 server and the Laptop is Windows XP professional one.

3. The text processing functionality completes within seconds. Is it inherently suitable for parallel processing ?. If not, why it is giving scalabilityon core to duo machine.

4. The application synchronizes between threadsduring the start up and towards the end using wait and notify operations. Isthis more expensive on quad core machine than dual core ?.

Thank you
Zoolii

Dmitry_Vyukov · ‎03-23-2010

1. X5355 is more than three years old, does the technology has something to do with this behavior? . Because core2 duo is newer thanQuad core X5355and the application is scaling there. Will it make any difference ifI test the applicationon Intel's latest quad core machines ?

Well, probably, yes. The cost of synchronization on older and bigger processor is probably higher.
However, the main factor is likely "the size" of the processor. So if you test on a newest 4 processor x 6 cores machine, you likely get poor results (if the problem is related to synchronization of course).

There are possible scenarios.
First, you have a problem that has a limited parallelization potential (does not scale beyond 2 threads).
Second, you have some problem in implementation (poor implementation that does not scale beyond 2 threads).

Dmitry_Vyukov · ‎03-23-2010

2. Does OS has some impact on this ?. The Xeon X5355 machine is a Windows 2003 server and the Laptop is Windows XP professional one.

Theoretically, Yes. But most likely No. Anyway XP and Server 2003 are roughly the same generation. The are major changes wrt multicore since only Windows 7 (some changes in Vista).
Good program scales linearly on any OS.
Poor program may somehow benefit from improvements in an OS, but I would not rely on that.

Dmitry_Vyukov · ‎03-23-2010

3. The text processing functionality completes within seconds. Is it inherently suitable for parallel processing ?. If not, why it is giving scalabilityon core to duo machine.

Seconds is enough for parallelization. It's problem/hardware/technology dependent, but in general seconds can be parallelized to hundreds of cores.
Thread startup/shutdown takes roughly ~1ms. So if run-time is at least ~10ms, you can speedup it with multithreading. And note that you can startup/shutdown threads in a tree manner (worker threads startup other worker threads) rather than linearly (main thread startups all the workers). This gives you logN starup time (N - number of threads).

Dmitry_Vyukov · ‎03-23-2010

4. The application synchronizes between threads during the start up and towards the end using wait and notify operations. Is this more expensive on quad core machine than dual core ?.

Most likely, Yes. However I would recommend to implement a program, so that cost of synchronization does not affect running time (while it's in reasonable limits of course).

The description you gave (synchronization only during startup and shutdown, and during main part threads work independently from each other) sounds pretty good. If running time is about seconds, the program must scale linearly on 2/4/8/16 cores.

I think you have some hidden problem in your code (implicit synchronization). Quite likely it's false sharing. Are your data-structures padded to prevent false-sharing?
Read this:
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/

zoolii · ‎03-24-2010

Hi Dmitriy,

Thank youfor your answers and link. I will have re look of my code. I want to clarify few more points.

The application use a thread pool. So there is not cost of starting and stopping of threads only wait and notify cost.

I got 800 TPS (for 2 threads)on dual core so time taken is roughly 1.25ms. Is it suitable for more than 2 threads ?.

If the program has an inherent problem why it is scaling on dual core ?

Thank you
Zoolii

Dmitry_Vyukov · ‎03-24-2010

It is not scaling on dual core. If it scale you get 100% speedup, but you get only 30%-50%. And scalability problems tend to show more on machines with higher number of cores.

1.25ms per workitem is far enough for parallelization. ~5 microseconds is the lower bound (highly dependent on technology used).

If the runtime starts threads for you, it does not mean that it becomes costless. It actually does not matter who exactly starts threads.

Ok, I see, you are using Win32 Threadpool API. AFAIK, before Win7 Threadpool used single centralized workqueue, which may become the bottleneck. Also Threadpool is a kernel API and it uses kernel non-paged memory, so the cost of enqueue/dequeue operation is quite high, thus workitems must be larger to be worth parallelization. Consider switching to user-mode lightweight scheduling library like TBB.

zoolii · ‎03-24-2010

Thank you forthe reply.

I am not using win32 thread pool api. I am using boost threads . It offers condition variables. The pools is my own one. One thread is adding work into the queue and send notification. another thread wait on that the condition variable get the notification and fetch the work to process. Towards the end, the main thread fetch the local result of each thread context. There also the main thread wait for the completion notification from other threads

I analyzed it with VTune and it says , the scalability issueis because of the frequent synchronization due towait and notify towards theend. But without that sort of synchronization the whole parallel aspects goes in vain. Do yousuggest any alternative to notify and wait ?. Ithink, to get something useful , some synchronization is needed at the end.

Any suggestions ?

Thank you Zoolii

zoolii · ‎03-24-2010

Thank you forthe reply.

I am not using win32 thread pool api. I am using boost threads . It offers condition variables. The pools is my own one. One thread is adding work into the queue and send notification. another thread wait on that the condition variable get the notification and fetch the work to process. Towards the end, the main thread fetch the local result of each thread context. There also the main thread wait for the completion notification from other threads

I analyzed it with VTune and it says , the scalability issueis because of the frequent synchronization due towait and notify towards theend. But without that sort of synchronization the whole parallel aspects goes in vain. Do yousuggest any alternative to notify and wait ?. Ithink, to get something useful , some synchronization is needed at the end.

Any suggestions ?

Thank you Zoolii

Dmitry_Vyukov · ‎03-24-2010

Difficult to say without seeing your implementation. Post a digest of your implementation.

Grant_H_Intel · ‎03-24-2010

Another possibility is that your application is memory bound. If it does relatively few operations on each item, then it doesn't matter how many threads you use. You will ultimately be limited by the bandwidth of the machine. If you max out at 30-50% speedup on two threads, then going to 8 threads may mean you are just adding synchronization, but the memory system is still not able to process the reads/writes any faster. That could easily result in significant slowdown.Streaming applications that do a similar number of operations to fetches nearly never scale well unless you havehuge amountsof memory bandwidth. Switching to a newer architecture like Core i7 or the newer Xeons may help quite a bit with memory bandwidth, but even then, you probably will not see linear scaling up to more than say 4-8 cores if your application is memory bound.

Hope this helps,

jimdempseyatthecove · ‎03-24-2010

Zoolii,

Can you code your text processing such that it can be pipelined parallelized as opposed to a fork and join type of parallelization?

A dual Xeon 5570 (Dell R710, Windows Server x64, RAID10) performing a relatively trivial text editing (upcase words) has near linear scaling through 16 threads (using QuickThread). There is some zig-zag effect every ohter thread due to this being a HyperThreaded system. Other than that the chart is near linear

1 thread ~200MB/sec
2 threads ~225MB/sec(+HT)
3 threads ~425MB/sec
4 threads ~450MB/sec (+HT)
5 threads ~625MB/sec
6 threads ~650MB/sec (+HT)
7 threads ~850MB/sec
8 threads ~875MB/sec (+HT)
9 threads ~1050MB/sec
10 threads ~1125MB/sec (+HT)
11 threads ~1285MB/sec
12 threads ~1310MB/sec (+HT)
13 threads ~1450MB/sec
14 threads ~1510MB/sec (+HT)
15 threads ~1600MB/sec
16 threads ~1625MB/sec (+HT)

Jim Dempsey

zoolii · ‎04-08-2010

Hi Grant,

Thank you for reply. I partially solved the issue. One thread was doing too much work and other threads were waitingfor. Now I am getting almost 90% scaleup in dual core machine. But the problem in 8 core machine still continues. For 2 threads it is giving 90% scaleup. But after that it is not at all scaling up. I am using a big array to store some intermediate result. if I do not touch the array ( for reading/writing ). It is scaling (though not linearly). This array is per thread basis. But the intermediate results of each thread is again processed by a dedicated thread to get the actual output while other threads create new array and produce intermediate result. It is here the problem happens. Any suggestions?. Any idea what is the cache line size for X5355 ?.I tried with aligning 32 and 64. Still no luck.The bios settings -adjacent cache line prefetcher/ Hardware prefetcher hassome role ?

Thank you
Zooli

TimP · ‎04-09-2010

Cache lines are the same for your dual core and dual quad core machines (64B, with adjacent line prefetch effectively doubling it for reads). On the dual core, each core can access lines brought into L2 by the other, but your 8 core machine has 4 separate L2 cache, making it more susceptible to false sharing. The total memory bandwidth of your 8 core machine with 1066 FSB may be little more than your 2 core machine with 1366 FSB, achieved only when the work is balanced between sockets.