Multi-thread slow in HT?

kenchen3k · ‎11-07-2005

Hi,

I am making game on P4 2.8 dual cores and wrote code to have AI thread and Render thread runing parellelly. While I found in Multi-thread mode, the AI part will run much slower than in single thread mode.

For example, if the render thread code in one frame is like:

sleep(100);

so the AI part in seperate AI thread will finish work in 0.033 senconds.

but if I change rendering thread code to

for (int i=0;i<100000;i++)

a++;

Then the AI part will finish work in 0.045 seconds.

That seems the HT and dual core in P4 is not fully parellel. And the workload in one core will affect another core even it do very simple work.

Am I correct? How can i utilize HT technicle in P4? thanks!

btw,I set the threads to different core by SetThreadAffinityMask

Message Edited by KenChen3k on 11-07-2005 03:13 AM

Message Edited by KenChen3k on 11-07-2005 03:17 AM

TimP · ‎11-07-2005

The behavior you report is likely to be OS dependent, if your affinity setting is not taking effect. On an OS which is not multi-core aware (Windows, linux over 4 months old), it may be worse with HT enabled than when it is disabled in the BIOS setup.

kenchen3k · ‎11-08-2005

I am using windowsxp sp2. Did you mean it doesn't support intel multi-core?

While in task manager i can saw 2 core are fully utilized.

Thanks!

TimP · ‎11-08-2005

XP SP2 may not know the difference between cores and hyperthread logical processors. The difference would be most evident if you run 2 threads on Pentium D EE with HT enabled, and both run on the same core a significant part of the time. I have seen a 15% boost in 2 thread performance by shutting off HT in the BIOS. I have no idea how typical that may be. You do have the option of setting thread affinity in your application, if you consider that practical, so you couldn't say multiple core is not supported at all. When you run 4 threads on that CPU, you expect the work to be spread out fully; whether that gives you best performance will depend on the application.
I haven't seen XP SP2 tested on a Paxville platform with HT enabled, so I won't comment on how well that would work.

chum · ‎12-05-2005

Hi, I had similar problems with second thread slowing down on unxepectedly HT enabled machine. The problem went away after I applied stack padding as suggested here:

http://cache-www.intel.com/cd/00/00/05/15/51533_chapter_5_memory_management02.pdf

Aparently the slowdownwas specific to shared cache issue.

jim_dempsey · ‎12-09-2005

I also had some similar experience with HT whereby my F90 application using OpenMP ran ~10% slower. However, after reworking the code for problems other than HT issues I re-ran a performance test of the OpenMP F90 application and now find a 23% improvement in the runtimes.

Since then I've migrated from a P4 530 single core with HT to a dual processor dual core/processor system (4 cores) no HT. Running much better now.

Jim Dempsey

SHIH_K_Intel · ‎04-04-2006

Here's some observation/hypothesis that may be of help in diagnosing your slow-down symptoms.

1. when your rendering thread does somthing like
for (int i=0;i<100000;i++)
a++;

It's doing a lot of read-modify-write.
An important rule in a multi-threaded design is to watch out/manage/minimize contended resources by two (or more) threads, including data/variables shared between two threads. Even writes to cached variables in one thread may experience interaction with operations on that variable (or the cache line containing that variable) due to cache coherency requirement.

If the variable "a" in your render thread is also referenced in your AI thread, the slowdown you observed is more likely due to inefficient design in the threading implementation of the application. For example, reads of "a" in your AI thread will experience slowdown due to writes of "a" in your render thread, because cache coherency will require the copy of "a". in the cache that AI thread acesses to be coherent with the modified copy of "a" written by the render thread. The more you write to "a" in your render thread, the more slowdown it will impact on the AI thread if it reference "a".

Some techniques to managing data sharing/placement can be found in the references below.

There are a few papers and references that you may be interested:

http://www.intel.com/cd/ids/developer/asmo-na/eng/columns/performance/53797.htm

http://cache-www.intel.com/cd/00/00/20/45/204562_204562.pdf

http://developer.intel.com/design/Pentium4/manuals/248966.htm