Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

How to run code on a specific CPU

mguptamel
Beginner
2,242 Views

Hi all,

I have multi-core processor (Core -2 duo) which has 4 logical processors. I would like to switch processor to run the code. For example,

;Assuming code is running on CPU-0

mov eax, ecx

; I want to switch to CPU-1 (don't know how to go about it)

mov ebx, eax

; Switch back to CPU-0 here

mov ecx, eax

Note: Above is just a sample code. My intentions are to learn how to switch b/w CPU's and to set code affinity to a specific CPU

Help appreciated

Regards

Gupta

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
2,242 Views
In your sample code you have one execution stream (software thread)
and you desire for it to migrate from one CPU to a different CPU (hardware thread)

This is an operating system request for most O/S's.

In Windows, the function is SetThreadAffinityMask
In Linux, the function is pthread_set_affinity_np (the "_np" indicates that this is non-portable and may or may not be supported on your operating system).

The affinity mask is a bit mask of the logical processor number (hardware thread number) on your system. There are additional function calls with somewhat different behaviors. (search documentation for "affinity")

The two function calls above can restrict the current software thread to run on any of the specified bit/bits... provided the process (program) has permission to run on the specified bits (implicitly the list of logical processors available to the system or subset thereof).

mov eax, ecx
{
push eax
push ecx
push other registers that need saving
mov eax, [hThread]
push eax
mov eax, [bitMask]
push eax
call SetThreadAffinityMask
pop other registers that were saved
pop ecx
pop eax
}
mov ebx,eax
{
push eax
push ecx
push other registers that need saving
mov eax, [hThread]
push eax
mov eax, [otherbitMask]
push eax
call SetThreadAffinityMask
pop other registers that were saved
pop ecx
pop eax
}
mov ecx,eax

The cost of the call SetThreadAffinitMask can be 1000's of clock ticks. So thread migration has to be important.

Jim Dempsey

View solution in original post

0 Kudos
9 Replies
carlomaria
Beginner
2,242 Views
Hi,
one way could be using MPI, in which each processor is assigned a rank. An example could be:
[bash]#include "mpi.h"

int main( int argc, char *argv[] )
{
int rank

MPI::Init(argc,argv);
size = MPI::COMM_WORLD.Get_size();
rank = MPI::COMM_WORLD.Get_rank();

if (rank == 0){

	Run Code    

    }

MPI::Finalize();

return 0;
}[/bash]
You can automatically switch rank using a counter.
Regards,
Carlo Maria
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,243 Views
In your sample code you have one execution stream (software thread)
and you desire for it to migrate from one CPU to a different CPU (hardware thread)

This is an operating system request for most O/S's.

In Windows, the function is SetThreadAffinityMask
In Linux, the function is pthread_set_affinity_np (the "_np" indicates that this is non-portable and may or may not be supported on your operating system).

The affinity mask is a bit mask of the logical processor number (hardware thread number) on your system. There are additional function calls with somewhat different behaviors. (search documentation for "affinity")

The two function calls above can restrict the current software thread to run on any of the specified bit/bits... provided the process (program) has permission to run on the specified bits (implicitly the list of logical processors available to the system or subset thereof).

mov eax, ecx
{
push eax
push ecx
push other registers that need saving
mov eax, [hThread]
push eax
mov eax, [bitMask]
push eax
call SetThreadAffinityMask
pop other registers that were saved
pop ecx
pop eax
}
mov ebx,eax
{
push eax
push ecx
push other registers that need saving
mov eax, [hThread]
push eax
mov eax, [otherbitMask]
push eax
call SetThreadAffinityMask
pop other registers that were saved
pop ecx
pop eax
}
mov ecx,eax

The cost of the call SetThreadAffinitMask can be 1000's of clock ticks. So thread migration has to be important.

Jim Dempsey
0 Kudos
mguptamel
Beginner
2,242 Views
Thanks Jim for clearing what I was after. However, I have another question if you or somebody can help in clarifying.

In above code example, you pushed some registers on to the stack for CPU-0. Assuming I have 4 logical CPU's and wish to switch to CPU-1. Will CPU-0's & CPU-1 have same stack segment?. If it's different then how OS handles CPU's switch for a thread. Does it copy stack frame for a given thread from CPU-0 to CPU-1.

If SS is same for CPU-1 & CPU-0, then wouldn't Stack contents can get corrupted because CPU-1 & CPU-0 can override each other shared stack. In my view, that must be handled by providing different stack for each logical CPU. Please correct me if I am wrong
0 Kudos
raffaellotamagnini
2,242 Views
Hi is there a way to select the CPU on which a program will run from within this program?
I'm looking for a way to run some test with the CPUID instruction on every logical CPU in the system, but the CPU that runs my program will be "randomly" selected by the scheduler... pls help me
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,242 Views
A software thread, consists of a context held in registers (eax, edx,..., eip, esp, xmm0, ...), and a context managed by the O/S (possibly hThread->context). The thread context may be the only, or one of many, software threads bound to a process (program/application). On an SMP system such as yours, the process (program/application) context has one Virtual Memory mapping which all threads of that process share.

In an SMP system, a software thread can run on anylogical processor, but may be restricted to a selected subset of logical processors (SetThreadAffinityMask). IOW, the O/S is free to migrate the thread from logical processor to logical processor as it sees fit subject to affinity mask restrictions. In the sample code I posted earlier, the application changed its affinity mask restrictions (to move from one logical processor to another).

The SetThreadAffinityMask will (when required) eventually issue an interrupt, push the thread state (registers),then issue an inter-processor interrupt to the target processor (or waits for the scheduler to lazily perform the context switch), then switches thread context to an available thread for the first processor. The second processor, upon receiving the inter-processor interrupt, (saves other thread context on that processor), enters scheduler, then resumes your thread on the second processor.

As to if this occures immediately or is deferred, this depends on the O/S scheduler.

When your thread (eventually) resumes on the alternate logical processor, it resumes with a copy of the same stack pointer, and pops the saved context off the stack.

While there are circumstances when you will want to migrate a single thread from logical processor to logical processor, it is often more advantageous to use multiple threads within the same application. You do (may) need to add code to coordinate the threads to avoid conflicts in memory access that may occur.

I suggest you begin looking at multi-threading using OpenMP or Cilk++, then as needs arise, look Threading Building Blocks (TBB), or QuickThread. Start with simple concepts then work up to more complicated capabilities. There are several basic concepts you will need to become aware of. If you start with simple capabilities, then you will acclimate yourself with these basic concepts.

Jim Dempsey
0 Kudos
mguptamel
Beginner
2,242 Views

Let's assume following code is running in kernel mode (ring-0)

function MySwitchFunction()
{
int a, b, c;

a =100;
b = 200;
c = 300;

;At this point code is running in CPU-0's stack & a/b/c are placed onto the stack

SwitchCPU(1); //THis code save context (I don't know what other context besides thread registers such eax, ebx, etc....


//c must be CPU-1 stack (How OS handles this?)
c = 100; //for example, mov 4[ebp], 100

//Switch back to CPU-0
SwitchCPU(0);

OutputContents(a, b, c); //What c should be at this point?
}

When threads are switched from CPU-0 to CPU-1, does OS saves/restores the whole stack frame for a given thread?

Any context switching source-code example will help though.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,242 Views
Ring-0 context switching, and CPU switching is generally beyond the scope of this forum.
Isuggest you consult:

IA-32 Intel Architecture Software Developers Manual Volume 3: System Programming Guide

for information relating to this subject.

Jim Dempsey
0 Kudos
mguptamel
Beginner
2,242 Views
Thanks Jim.

I have, however, figured out how OS (Linux) handles context switching b/w different threads. Each taks (thread) is allocated 4KB/8KBkernel (ring-0) stack at the time of creation. Switch b/w thread is simply a matter of changing TSS (Task state segment) ESP0 to switching thread's top of kernel stack. There more to it task switching but above outlines the main concept behind Linux context switching.

When going through Intel documentation, pentium 4+ provide task switch capability implemented in it's hardware. Since, hardware task switch doesn't save all CPU registersis the reason OS'es (Linux, windows) relies of software logic for task switching.

I, sort of got confused as to which way to proceed, software vs. hardware. One downside (please correct me if I am wrong) of going through hardware way is the limitation of 1024 entries in GDT table, which means at the maximum only 1024 TSS segment descriptors can be defined (TSS must exist in GDT table as per Intel documentation). Now if I have more than 1024 tasks, surely OS will have to make changes in GDT all the time to support them. I don't know why TSS had to be in GDT table. Surely must havevery strong reason for Intel-Chip designers.

Well, I thanks Jim & everybody in the group for clarifying some of my doubts.

Regards
Gupta
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,242 Views
Gupta,

"Task" is a relative term.

To the O/S task could mean "process", and said process may contain multiple software threads (process threads, each running a "task" in the process).

Alternitively, the O/S need not (frequently) use the hardware "task" system to manage processes and/or software tasks.

Therefore therequired number ofTSS descriptors in the GDT could potentialy be reduced to a working set (approximately) equal to the number of hardware threads supported by the CPU (typically 2, 4, 6, 8, 12, 16). And with the O/S having possibility of seperate GDT per CPU (processor chip).

Jim Dempsey


0 Kudos
Reply