hyperthreading performance

emihaly · ‎09-02-2003

hi,

i have some questions for hyperthreading:
(icc + Linux)

1. why openmp? I call direct threads on linux c++ - pthread_create. icc it not optimize like via openmp command or is same?

2. how can i tune app for hyperthreading via thread stack problem. I read hint to change offset but what i realy must do? Change size via ulimit -s ?

3. I have lot of mutex_lock/mutex_unlock on int variables like ++, x=x+10 .... Have icc some specific atomic operation for it to prevent overhead with mutex_lock/unlock operation?

4. Optimizing for hyperthreading is default on on icc or i must enable it via some swich?

TimP · ‎09-02-2003

1. OpenMP certainly isn't directly competitive with hand threading. Where it is appropriate, it offers a higher level, more portable way of threading. Also, it lends itself to debugging tools which check for consistency with the single thread case. It allows the application to adapt automatically to the number of threads supported by each system.

2. OpenMP would be expected to avoid thread stack cache conflicts. By the same token, you might ask whether any version of libpthread begins to take care of this by default. It's not a question of stack size, it's a question of making sure that the active parts of the stacks don't evict each other from cache, by choosing favorable offsets for the stack bases. You should find plenty of advice on this for the Windows case.

3. Other than #pragma omp atomic , I believe not.

4. From the earlier questions, it seemed that you were aware that you would need hand threading or OpenMP to take advantage of hyper-threading within a single application. There is also an auto-parallelization switch, which might get you some of the advantage of OpenMP, without requiring the directives or pthread calls.
Several of the performance libraries enable threading within their own functions, consistent with OpenMP, with the environment variables controlling the number of threads permitted.

Henry_G_Intel · ‎09-02-2003

Hello Eduard,

1. why openmp? I call direct threads on linux c++ - pthread_create. icc it not optimize like via openmp command or is same?

In general, OpenMP is best for expressing data parallelism while thread libraries like Pthreads are best for expressing functional decomposition. If your problem scales with the amount of data, it's probably data parallel. If your problem scales with the number of independent tasks, a functional decomposition is probably best. This question is answered in greater detail in Developing Multithreaded Applications: A Platform Consistent Approach. See section 3.1, "Choosing An Appropriate Threading Method: OpenMP Versus Explicit Threading."

2. how can i tune app for hyperthreading via thread stack problem. I read hint to change offset but what i realy must do? Change size via ulimit -s ?

Offsetting thread stacks has nothing to do with the ulimit command. See section 5.3 "Offset Thread Stacks to Avoid Cache Conflicts on Intel Processors with Hyper-Threading Technology" of Developing Multithreaded Applications: A Platform Consistent Approach for a description of Hyper-Threading cache conflicts and how to fix them by offsetting thread stacks. This article contains code examples showing how to offset thread stacks.

The following code snippet illustrates how to offset thread stacks:

#define OFFSET (3 * 1024)
 
int main ()
{
   for (j = 0; j < nThreads; j++)
   {
      myID = j + 1;
      pthread_create (&tid, NULL, ThreadFunc, (void *)&myID));
   }
}
 
void* ThreadFunc (void* myID)
{
   alloca ((*(int *)myID) * OFFSET);
 
   /* Do some work */
 
   return NULL;
}

3. I have lot of mutex_lock/mutex_unlock on int variables like ++, x=x+10 .... Have icc some specific atomic operation for it to prevent overhead with mutex_lock/unlock operation?

The Intel compiler for Itanium has intrinsics for fast atomic operations. They're defined in the ia64intrin.h header. See "Lock and Atomic Operation Related Intrinsics" in the Intel C++ Compiler User's Guide for descriptions of these intrinsics.

I couldn't find equivalent functions for IA-32. On IA-32 you could try replacing your Pthreads mutex lock/unlock operations with OpenMP locks or atomic pragmas. That might improve performance but I cannot guarantee it. Use the -openmp switch to enable OpenMP in the Intel compilers.

4. Optimizing for hyperthreading is default on on icc or i must enable it via some swich?

There's no compiler switch to optimize for Hyper-Threading. Hyper-Threading benefit is largely determined by your threaded code. As a general rule, a multithreaded application that shows good parallel speedup on an SMP system should benefit from Hyper-Threading Technology.

Best regards,
Henry

Message Edited by intel.software.network.support on 12-09-2005 02:30 PM

Intel_C_Intel · ‎09-02-2003

> I couldn't find equivalent functions for IA-32.

The IA-32 instruction-set has atomic ops:

xchg
xadd
cmpxchg

and the lock prefix.

Here is an implementation of some IA-32 atomic ops:

IA-32/x86 Atomic Ops

Intel_C_Intel · ‎09-02-2003

> xchg
> xadd
> cmpxchg
>
> and the lock prefix.

Whoa!

Can't forget the all important cmpxchg8b opcode!!!

=P

ClayB · ‎09-15-2003

This may be too late to benefit the original poster, but I thought I could add a few pennies worth of advice.

Whether you are programming with explicit threads or OpenMP for Hyper-Threading, don't try to divide up your computations to get different threads performing different types of computations. That is, unless the logic of the algorithms parallelize that way naturally.

I've talked with several engineers that had the mistaken impression that source code modifications could be used to directly affect the execution performance on Hyper-Threaded platforms. Granted, there are better ways to program and methods to give the compiler some hint as to what computations are being done that might provide better performance. However, after being passed through the compiler with optimizations and code transformations, along with out of order scheduling and not being able to deterministically control the execution of threads with each other, I think it highly unlikely that source code changes will be able to consistently take full advantage of the extra computation resources that Hyper-Threading makes available.

I have yet to see an application that could assign all the floating-point computations to one thread and all the integer computations to another in such a way that there is no additional synchronization overhead introduced to wipe out any threading or Hyper-Threading benefits. Codes just aren't built that way. (If you have a counter example, though, I'd love to hear about it.)

So, the best general advice I can think of is program as usual; rely on the compiler and the scheduler to get the best out of Hyper-Threading. Of course, don't do anything that would hamper execution like using spin-waits; and keep the thread stacks from aligning up for cache problems.

-- clay

Henry_G_Intel · ‎09-22-2003

> I have yet to see an application that could assign all
> the floating-point computations to one thread and all
> the integer computations to another in such a way that
> there is no additional synchronization overhead
> introduced to wipe out any threading or Hyper-Threading
> benefits.

I agree with Clay. Even if it was possible to decompose a real application into floating point and integer calculations, load balance would be nearly impossible to achieve.