[Beginner's Question] My simple program run 3-4times slower in multithreaded version - Please Help

sonkanit · ‎10-09-2006

I try to port my current terrain engine to take the advantages of HT + dual core technology via multithreading. I am new to these topics and face a beginner's problem that you may help me in a very short period of time. I definitely need your suggestion. Firstly, I want to make sure that using multithreading in an appropriate way will give the better performance in such platforms. So I tried to write a very simple code that does nothing except generating a random color for each pixel. The result is very unacceptable to me. The single thread code is faster for 3-4 times on both HT and Dual Core. Here is the code snippest.

-----------------------------------------------------------------

Main Thread

-----------------------------------------------------------------

g_iFullWay = iWidth * iHeight;

g_iHalfWay = g_iFullWay /2;

DWORD dwThreadID;

HANDLE hThread = CreateThread(0, 0, DataThread, 0, 0, &dwThreadID);

for(int i = 0; i

{

for(int j=0; j<2000; j++) //This loop is only make it slower (Better comparation)

g_pDataBuffer = rand();

}

WaitForSingleObject(g_hCompleted, INFINITE); CloseHandle(hThread);

-----------------------------------------------------------------------

Data Decomposition - Thread

-----------------------------------------------------------------------

for(int i=g_iHalfWay; i

{

for(int j=0; j<2000; j++)

g_pDataBuffer = rand();

}

SetEvent(g_hCompleted);

-----------------------------------------------------------------------

Your comments and suggestions would help me to understand the multithreading programming better. I would like to take those adventages but I have no experience in this topic. Your word will be an inspiration to me. Thanks in advance.

TimP · ‎10-09-2006

I suppose you're using a default serial pseudo-random sequence, likely containing a critical section. If so, you've replicated the standard demonstration of the slowness of calling a critical section from multiple threads.

sonkanit · ‎10-10-2006

Thanks. I have tried using i*j*i*j instead of rand(). The multithread program run a bit faster than single threaded version but what I'd like to see is that we can double the performance.

However, the rand() versionhas 100% CPU utilization. If a critical section is called from multiple threads, CPU utilization should not reach the maximum value? Do I miss a point?

There is a guy tell me that there is a race condition on g_iDataBuffer. I am thinking about it. Anyway, what I want is to double the performance with data decomposition. How can I achieve this? Thanks in advance.

TimP · ‎10-10-2006

Perhaps I jumped to the conclusion that you have set up g_iDataBuffer[] as a shared "array." If so, it may limit performance in various ways. If the threads are working on the same element, of course there is a race condition. If different elements, they must be far enough apart in memory if false sharing is to be avoided. Even then, normal settings of hardware prefetch might trigger memory stalls.
If you do have a race, Intel Thread Checker should be able to diagnose it.
Now that OpenMP is supported by a wide range of compilers, that would seem a more satisfactory way to begin on such problems. Not that it would avoid the problems you have posed, but it should provide a clearer and more portable way.

jimdempseyatthecove · ‎10-11-2006

Avoiding data cache colisions in a multi-threaded program can often be attained relatively easily. In the example with two threads processing a common array, and if the process sequence in non-temporal (does not have to be processed sequentialy) then one of the easiest ways to reduce the effects of data cache collision is to run one loop ascending and the second loop decending.

// 1st thread
for(int i = 0; i{
...
}

------------

// 2nd thread
for(int i=g_iFullWay; i>=g_iHalfWay; i--)
{
...
}

Notes.

Most data cache systems have the potential for collisions at a modulus of a given size. e.g. 64KB where data ataddress 0x12340000 and data at address 0x43210000 would occupy the same cache location (i.e collide). The data cache line is generally wider than most data types (e.g. 16 bytes wide). If your data array is sufficiently large then it is likely that there are multiple addresses at the same modulus. In such case, processing both loops forward has a chance that
one loop can catch up to the cache modulus of the other loop. When this happens both loopssuffer performance. By running one loop forward and the other loop backwards then collisions are infrequent. For type float (4 bytes) with 16 byte cache line and 64KB address modulusyou have approximately 4 collisions per 16,000 iterations.

Jim Dempsey