Re: number threads = number of processors ?

ekeom · ‎08-17-2009

Dear All,

I have a basic question. using Fortran-OpenMP tools on a Duo Core processor, how many threads can I define? In an other hand, is the maximun number threads equal two number of avaible processors?

Best regards,

Didace

Tudor · ‎08-17-2009

Quoting - ekeom

Dear All,

I have a basic question. using Fortran-OpenMP tools on a Duo Core processor, how many threads can I define? In an other hand, is the maximun number threads equal two number of avaible processors?

Best regards,

Didace

There are two types of processors that we must work with while doing parallel programming: physical processors and logical processors. The number of logical processors (processors that the operating system and applications can work with) is (usually) greater or equal to the number of physical processors (actual processors on the motherboard). For example, a hyperthreaded processor with 4 physical processors will have 8 logical processors. That means that the operating system can schedule up to 8 threads at the same time, even though on the motherboard you only have 4 processors.
Therefore, the maximum number of threads you can create is equal to the number of logical processors your operating system sees. Core duo and core 2 duo are not hyperthreaded, hence the maximum number of threads you can create is 2, since you have 2 physical processors and 2 logical processors.

ekeom · ‎08-17-2009

Quoting - Tudor

The are two types of processors that we can encounter while doing parallel programming: physical processors and logical processors. The number of logical processors (processors that the operating system and applications can work with) is (usually) greater or equal to the number of physical processors (actual processors on the motherboard). For example, a hyperthreaded processor with 4 physical processors will have 8 logical processors. That means that the operating system can schedule up to 8 threads at the same time, even though on the motherboard you only have 4 processors.
Therefore, the maximum number of threads you can create is equal to the number of logical processors your operating system sees. Core duo and core 2 duo are not hyperthreaded, hence the maximum number of threads you can create is 2, since you have 2 physical processors and 2 logical processors.

Thank Tudor,

For your answer.

Best regards,

Didace

TimP · ‎08-17-2009

There's no definite limit here. The usual number of OpenMP threads would be the number of cores, or, possibly, with HyperThreading support (not present on Core Duo), the number of logical processors. OpenMP would default to a number of threads equal to number of detected cores, but you could over-ride by setting OMP_NUM_THREADS, unless you wrote limits into your program.

Tudor · ‎08-17-2009

Quoting - tim18

There's no definite limit here. The usual number of OpenMP threads would be the number of cores, or, possibly, with HyperThreading support (not present on Core Duo), the number of logical processors. OpenMP would default to a number of threads equal to number of detected cores, but you could over-ride by setting OMP_NUM_THREADS, unless you wrote limits into your program.

You are right, a mistake on my part: I meant that the maximum number of threads that can run at the same time is 2. There is no actual limit to how many you can create manually.

jimdempseyatthecove · ‎08-17-2009

You can run more threads than you have logical processors. However, unless those threads are performing I/O, using more threads than you have logical processors will generally be less effective than using the number of threads as you have logical processors.

The time required to save the state of a thread and then restore the state of a different thread is significant. The trend now is to setup the number of threads equal to the number of logical processors, then use a programming paradigm that creates many more task (as opposed to threads). To switch from task to task (usually task run to completion) requires very little context to switch, usually a couple of registers, as opposed to all registers (integer, FPU, and SSE, page table, etc...).

When threads (or tasks)perform I/O then it is advisible to have a few more threads. But each application will have its own set of requirements.

Jim Demspey

DweeberlyLoom · ‎08-17-2009

Quoting - jimdempseyatthecove

You can run more threads than you have logical processors. However, unless those threads are performing I/O, using more threads than you have logical processors will generally be less effective than using the number of threads as you have logical processors.

The time required to save the state of a thread and then restore the state of a different thread is significant. The trend now is to setup the number of threads equal to the number of logical processors, then use a programming paradigm that creates many more task (as opposed to threads). To switch from task to task (usually task run to completion) requires very little context to switch, usually a couple of registers, as opposed to all registers (integer, FPU, and SSE, page table, etc...).

When threads (or tasks)perform I/O then it is advisible to have a few more threads. But each application will have its own set of requirements.

Jim Demspey

It is interesting how we all moved to threads because the context switch time to move from process to process was too great. Now the context switch for threads is becoming too great. Didn't the context switch for threads start out much lighter than it is now? I wonder if this is a possible area for hardware optimization, where there would be some "uber-fast" way to dump/restore a cores context (perhaps to a small specialized cache), of course there would still be a processor cache penalty for swapping the process flow.

To come back to the thread, it's very difficult to know what the optimal number of threads should be and it greatly depends on what you are doing. For example I've noticed that on multi-core machines my windows GUI almost never hangs due to something sucking up all the CPU cycles. As more applications are designed to be multi-core that will change. If I had a 4+ core machine I might be glad to give up one core just to keep my GUI happy. If I'm running a server I might not care too much about my GUI.

The unfortunate answer to how many threads do I need is "you need as many as you do, and perferably no more". I'm consistently amazed at how dog-gone fast these processors are. If I want to code a parallel for loop, I've found I really have to do a lot of work withing the loop and a lot of iterations before I can overcome the cost of the threading. The hardware, software and our understanding will almost certainly get better as time goes by, but for the moment it's still often difficult to estimate the costs and benefits of a multi-threaded approach in most apps.

jimdempseyatthecove · ‎08-18-2009

>>It is interesting how we all moved to threads because the context switch time to move from process to process was too great.

Inter-process communication was too great (without use of memory mapped files or other partially shared memory constructs). And cumbersome as the app needed to knowwhat was shared and what was not. The cumbersome-ness was removed by extending the extra thread's address space to encompass the complete address space of the main thread - ta da - multi-threading within a process.

>>Didn't the context switch for threads start out much lighter than it is now?

When you only had 8 registers it was
When you had 8 registers and the FPU stack it still was
When you added MMX and then SSE registers it now becomes significant again.

>>I wonder if this is a possible area for hardware optimization, where there would be some "uber-fast" way to dump/restore a cores context

On other processor archetectures there is. All the registers, except one,are in RAM (cached when possible). The one exception is a register base register which holds the bass address of the memory location of a struct of registers. The instruction set, when accessing register 3 (with some mnemonic) references an offset off the struct of registers. Context switch is essentially changing the address of registers location.

This technique is fast for context switching however, general register referencing tends to be slower than with dedicated space for register (i.e. cache resolution slows down register access).

>>If I want to code a parallel for loop, I've found I really have to do a lot of work withing the loop and a lot of iterations before I can overcome the cost of the threading.

If you live long enough, this will change.

Today the cores (hardware threads with HT) are in a general pool subject to scheduling to any process by the O/S.
Tomorrow, when virtually all apps aremulti-threaded,and when many morehardware threads are crammed onto a die, it will be advantageous to provide an environment where threads are grouped up in bunches, The bunch is scheduled to the process (or multiple bunches per process for higher demanding processes).

You could do this today to a limited extent. With systems with HT, it might be advantageous to schedule the HT siblings as a pair to a process. Current design of HT has siblings sharing L1. With proper coding, it can be advantageous to have two threads within the process sharing an L1 cache. Under almost all circumstances it is counter productive to have two processes sharing the same L1 cache. Therefore, there (c/w)ould be an incentive to provide this capability within the O/S.

Later, should this be adopted, more of the resources of the HT siblings could be shared, on a case-by-case basis when shown to be beneficial. Havingdedicate HT threads to a main thread would then permit an architectural enhancement to add an extremely low cost thread start/stop (fork/join). This approach isn't new. Systems used to have one FP87, then one internal implementation, same with MMX and SSE then with multi core, multiple floating point execution units. HT regressed a bit with single floating point per core. This may flip back to multiple floating point per HT but where HT now means processing bunch as referenced above.

Jim Dempsey

DweeberlyLoom · ‎08-18-2009

Quoting - jimdempseyatthecove

>>It is interesting how we all moved to threads because the context switch time to move from process to process was too great.

Inter-process communication was too great (without use of memory mapped files or other partially shared memory constructs). And cumbersome as the app needed to knowwhat was shared and what was not. The cumbersome-ness was removed by extending the extra thread's address space to encompass the complete address space of the main thread - ta da - multi-threading within a process.

>>Didn't the context switch for threads start out much lighter than it is now?

When you only had 8 registers it was
When you had 8 registers and the FPU stack it still was
When you added MMX and then SSE registers it now becomes significant again.

>>I wonder if this is a possible area for hardware optimization, where there would be some "uber-fast" way to dump/restore a cores context

On other processor archetectures there is. All the registers, except one,are in RAM (cached when possible). The one exception is a register base register which holds the bass address of the memory location of a struct of registers. The instruction set, when accessing register 3 (with some mnemonic) references an offset off the struct of registers. Context switch is essentially changing the address of registers location.

This technique is fast for context switching however, general register referencing tends to be slower than with dedicated space for register (i.e. cache resolution slows down register access).

>>If I want to code a parallel for loop, I've found I really have to do a lot of work withing the loop and a lot of iterations before I can overcome the cost of the threading.

If you live long enough, this will change.

Today the cores (hardware threads with HT) are in a general pool subject to scheduling to any process by the O/S.
Tomorrow, when virtually all apps aremulti-threaded,and when many morehardware threads are crammed onto a die, it will be advantageous to provide an environment where threads are grouped up in bunches, The bunch is scheduled to the process (or multiple bunches per process for higher demanding processes).

You could do this today to a limited extent. With systems with HT, it might be advantageous to schedule the HT siblings as a pair to a process. Current design of HT has siblings sharing L1. With proper coding, it can be advantageous to have two threads within the process sharing an L1 cache. Under almost all circumstances it is counter productive to have two processes sharing the same L1 cache. Therefore, there (c/w)ould be an incentive to provide this capability within the O/S.

Later, should this be adopted, more of the resources of the HT siblings could be shared, on a case-by-case basis when shown to be beneficial. Havingdedicate HT threads to a main thread would then permit an architectural enhancement to add an extremely low cost thread start/stop (fork/join). This approach isn't new. Systems used to have one FP87, then one internal implementation, same with MMX and SSE then with multi core, multiple floating point execution units. HT regressed a bit with single floating point per core. This may flip back to multiple floating point per HT but where HT now means processing bunch as referenced above.

Jim Dempsey

Well I suppose I'm taking the original thread "down a rathole" but I find this interesting. I can imagine, which is easy for me cause I don't have to build it :-D, a multi-core die with n fixed locations each m bits wide that could step in as the register set. Each core would only need a p bit "routing" register (not saved) which would point it to one of the nlocations. Perhaps zero'ing the routing reg would turn the core off to save power. Assuming that the n x m bits were located in a dedicated (fixed)high speed memory cache, an actual switch of context could be a 1 cycle operation without a significant impact on the time required to address core registers.

Of course this still leaves a lot (most?)of theoverhead in refilling the pipeline and the housekeeping (of the code managing the threads). You comment about "bunching" hardware threads would certainlyreduce that. If the controlling code could depend on "owning" a hardware thread, it could schedule larger time windows and minimize pipeline restarts. Sharing L1 cache is a bit "hard-core" (not sure if pun is intended :-D) for me but I can see it being advantageous. For general purpose programming it seems like it would be difficult for the compiler to

If as you suggest these bunches of cores could be bundled (packaged) such that they contained x cores which shared cache space. However, sharing L1 cache seems like it might be a bit too complex for general purpose programming (or at least for compilers/oses to generally beeffective with).

Caches seem kind of weird when it comes to multi-processing. Take for example a simple "sieve of eratosthenes" algo. If I had a chip with 16 cores and 32 MB of L3 cache, I could write multi-threaded code to implement this in many ways, but lets just look at two. If I could ensure that all the cores would start a roughly the same time I might block canidates into 32MB chucks, then give all 16 cores a different starting value (2, 3, 5, 7, 11, ...) and have it start marking values out. However maybe it would be better to have 8 cores marking off in ascending order and 8 cores marking off in decending order. I imagine the second approach is better, especially since it might be hard to ensure that all the cores start at the same time. In the second approach by the time the "middle" is reached the cache is full.

No matter what I'm sure I'll be quite amazed at the future.

jose-jesus-ambriz-me · ‎09-03-2009

There was a post that write about overhead work for changing the context information between threads. I think that is an important reason for don't use more threads than cores.

Best regards

jimdempseyatthecove · ‎09-04-2009

a multi-core die with n fixed locations each m bits wide that could step in as the register set. Each core would only need a p bit "routing" register (not saved) which would point it to one of the nlocations. Perhaps zero'ing the routing reg would turn the core off to save power. Assuming that the n x m bits were located in a dedicated (fixed)high speed memory cache, an actual switch of context could be a 1 cycle operation without a significant impact on the time required to address core registers

Predictably, any such design may become moot. When Larrabee type of archetecutre becomes available, it would not be unreasonable for a 2nd or 3rdgen Larrabee to place into the system 128 or more processors. At that point, a software design change in the O/S, potentially augmented with a simple processor change, could facillitate writing processes that are always guaranteed to have n-HW threads. (n being fixed). This is similar to what SIMD offers to vectorization now. Adding n-1 sets of general purpose registers to a context switch might not be all that bad. The archetecture then becomes MIMD w/SIMD. Note, the new n-HW threads can be an extention of HT concept.

Jim Dempsey