Performance loss with Hyperthreading

tobix · ‎08-14-2003

Hi,
we are developing an imaging application that uses per image a thread to correlate the image to a reference image. An image takes about 1MByte of data.
The time for correlation takes without hyperthreading 100
secs, with hyperthreading enabled 120 secs!
The application is written with MFC, is running under XP and is using native Win32 threads.
When I assign all threads to a single CPU (using SetThreadAffinityMask) the correlation takes again 100secs.
Assigning the threads alternating to both CPUs I come up again to 120secs.

BTW: When running on one CPU the taskmanager shows 100% usage for the first CPU and 0 % for the second.
When running on both CPUs the taskmanager show 100% both CPUS for 120 seconds!

And before someone asks: the correlation code is not stuffed with synchronisation objects and there are no while (do-nothing) loops.

Any idea where I can start tracking down this problem?

Thanks for all hints
Tobias

Henry_G_Intel · ‎08-14-2003

Hi Tobias,
This could be a cache conflict between thread stacks. Chapter 5 of Developing Multithreaded Applications: A Platform Consistent Approach describes Hyper-Threading cache conflicts and how to fix them. Here's the abstract from section 5.3, Offset Thread Stacks to Avoid Cache Conflicts on Intel Processors with Hyper-Threading Technology:

Hyper-Threading enabled processors share the first level data cache on a cache line basis among the logical processors. Frequent accesses to the virtual addresses on cache lines modulo 64 KB apart can cause alias conflicts that negatively impact performance. Since thread stacks are generally created on modulo 64 KB boundaries, accesses to the stack often conflict. By adjusting the start of the stack, the conflicts can be reduced and result in significant performance gains. Note that the 64 KB alias conflict is processor implementation dependent. Future processors may adjust the modulo boundary or eliminate this conflict altogether.

I've encountered a couple of threaded codes (one using Pthreads and one using Windows threads) that slowed down when Hyper-Threading was enabled. After offsetting the thread stacks both codes showed a speedup from Hyper-Threading. Please let us know if it solves your problem.

Best regards,
Henry

Message Edited by hagabb on 05-10-2004 04:18 PM

tobix · ‎08-15-2003

Hi Henry,
thanks for your reply.
Indeed this helped. I came down from 100secs without HT to 78secs with a stack offset of 1024. Incresing this to 4096 gave me another 2 seconds.

Thanks again!
Tobias

Henry_G_Intel · ‎08-15-2003

Hi Tobias,
I'm glad it helped. The alloca trick can often turn a Hyper-Threading performance loss into a performance gain. A 1.28 speedup from Hyper-Threading is pretty good.

Best regards,
Henry

Intel_C_Intel · ‎09-26-2003

hi, i have a similar problem with HT technology, but i have a question about advice which come from "Developping Multithreaded application:...."

if the thread parameters are encapsulate in a class...
in fact i pass the "this" ptr in parameter attribute of _beginthreadex function.

thread_wk=_beginthreadex(NULL, 0, &TranscodeurWatermarking, this, 0, &adresse_thread);

so i have add a struct FunctionBlk to object and used the trick ...

s_ThreadCallData.iThreadNum = 2;
s_ThreadCallData.pThreadFuncPtr = transcodeurWatermarking;
s_ThreadCallData.pThreadParameters = (void*)(this);
thread_wk=_beginthreadex(NULL, 0, &HT_SpecificThread, &(this->s_ThreadCallData), 0, &adresse_thread);

but the padding which is describe in the struct ParameterBlk is not present in my class description and it seems to me that i cannot perform this padding because in object, data have not a fixed place....

it's right or i must add "char padding ...." at the end of class declaration(interface).

thanks, all advice are welcome

Henry_G_Intel · ‎10-02-2003

Hi,
Sorry this took so long but I had to consult with a C++ expert. Here is his response:

If your issue is with memory alias conflicts on the stack, padding your class data structure will not work. The padding has to be applied to each thread stack in the first function. All threads must be provided with a starting function (or function address) where the thread will begin execution. This is where you can adjust the stack offset by using the _alloca call and using the parameter block data structure to provide a unique thread number. You must then call the "real" thread function from the starting function. Doing the "real" work in the starting function will not benefit from the stack offset. It's best to use a function pointer to ensure that the "real" function does not get inlined during the compilation/linking optimization phase.

Based on your data structure ....

s_ThreadCallData.iThreadNum = 2;
s_ThreadCallData.pThreadFuncPtr = transcodeurWatermarking;
s_ThreadCallData.pThreadParameters = (void*)(this);
thread_wk=_beginthreadex(NULL, 0, &HT_SpecificThread, &(this->s_ThreadCallData), 0, &adresse_thread);

And using your example ....

#include 
#include 
#include 
#include 
 
struct ThreadData {
	unsigned int iThreadNum;
	unsigned (*pThreadFuncPtr)(void*);
	void* pThreadParameters;
}; 
 
// default 1K but you may change to optimize performance for your case
#define THREAD_OFFSET 1024
 
unsigned HelloWorld (void* callData)
{
	printf ("%s
", (char*) callData);
	return 0;
}
 
unsigned __stdcall HT_SpecificThread (void* callData)
{
	struct ThreadData* data;
	data = (struct ThreadData*) callData;
 
	// alloca some space on the thread stack
	_alloca ( data->iThreadNum * THREAD_OFFSET);   
 
        // This is where the real function that does the work is called from
	return (*data->pThreadFuncPtr)(data->pThreadParameters);
}
 
int main(int argc, char* argv[])
{
	struct ThreadData myThreads[2];
	unsigned adresse_thread;
 
	myThreads[0].iThreadNum = 1;
	myThreads[0].pThreadFuncPtr = (unsigned (*)(void *)) &HelloWorld;
	myThreads[0].pThreadParameters = (void*) "HelloWorld 0";
	myThreads[1].iThreadNum = 2;
	myThreads[1].pThreadFuncPtr = (unsigned (*)(void *)) &HelloWorld;
	myThreads[1].pThreadParameters = (void*) "HelloWorld 1";
 
	_beginthreadex(NULL, 0, &HT_SpecificThread, &(myThreads[0]), 0, &adresse_thread);
	_beginthreadex(NULL, 0, &HT_SpecificThread, &(myThreads[1]), 0, &adresse_thread);
 
	Sleep (100);  // Hack just to let threads finish since we have no synchronization
 
	printf("Hello World! Finished
");
	return 0;
}

Note that you have to increment iThreadNum for each thread.

Aaron_C_Intel · ‎10-13-2003

I should add one more thing that I've found. When using this approach I've found offseting by odd multiples of 4096 to be the best in performance. The 4096 is due to the page size used by the OS. Odd multiples, I've never been able to prove why, just seem to work better (this really only applies once you scale beyond one physical processor with HT, say to two physical procs with HT or more).

The class or struct you use to pass parameters or if you just use the this pointer, then the class itself should be padded and aligned to L2 cache size. This will prevent any possible false sharing. Note that even if you use the this pointer approach and then redirect to a method for your thread function, you will want to make sure the class is padded and aligned.

my two cents