Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Performance of multi-core processors for single-threaded applications

mukishere
Beginner
1,470 Views
Hi,

I am running some performance tests on the Windows 2003 server which has Intel Xeon "quad-core" processors.
Please find my questions embedded in the observations.

I am running a single thread executable (pls find attached the exe in the test exe folder) using a cygwin window and it takes 14.5 s while the CPU usage shows 25% in the TaskManager Performance tab.

Q1) Is it right to say that this thread runs on only 1 of the 4 cores i.e other 3 cores do not share this task,when I run a single instance?

Now I opened two cygwin windows and am running the same executable simultaneously(the time taken switch between windows is negligible).This time each exe took around 17.5s and the CPU usage shows 50 %.

Q2) Is it correct that now we are utilising 2 cores out of 4 and that each cygwin instance is running on different cores?

Q3) Why is the exe taking more time to run in this case (i.e around 3s more) ? If the tasks are running on different processors , then ideally shouldnt it take the same time as running a single instance?

Next I opened three cygwin windows and am running the same executable simultaneously(the time taken switch between windows is negligible).This time each exe took around 21s and the CPU usage shows 75 %.

At last I opened four cygwin windows and am running the same executable simultaneously(the time taken switch between windows is negligible).This time each exe took around 27.5s and the CPU usage shows 100 %.

Q4) Is it correct that now we are utilising all 4 cores and that each cygwin instance is running on different cores?

Q5) Why is the exe taking more time to run in this case (i.e around 13s more compared to single instance) ? If the tasks are running on different processors , then ideally shouldnt it take the same time as running a single instance?


Any explanations to clearly validate the observations seen will be highly appreciated. Probably this will give me a better picture of how multi-core processors are useful for single-threaded applications over single core processors.


Thanks..

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,470 Views

A1) Your application will tend to run on one core. But unless it is pinned to one core other activity by other applications (e.g. AV, Internet Update, ...) may cause the application to jump cores.

A2) Your applications will tend to run in two cores. But...(see A1)

A3) If your application is entirely cache bound, and performs no writes you might see (closer to) the same run time. However, when your application accesses memory that is not in cache it will compete with the memory access of all other cores (hardware threads). Also, if the application is performing writes to memory, and depending on the locations, it may cause a cache eviction for the other core(s) (hardware threads). As you increase the number of copies of the application you will increase the demands on the memory subsystem. There is also another consideration where the O/S is managing another instance of the application including managing more windows on the desktop.

A4) Yes. But see A1, A2, A3 for additional details.

A5) As you went from 1 instance to 2 instances the memory interaction cost 20% (3/14.5) but this is amortized over two cores so 10% interaction expense, from 1 to 3 cost 44.8% (6.5/14.5) ammortized over 3 cores 15.8%, going from 1 to 4 cost 89.6% (13/14.5) ammortized over 4 cores 22.4%. The interaction expense is about n(5.8% + n*0.4%) where n is the number of hardware threads. Assuming no change in cache size at this rate you would max out (saturate)at 11 cores.

To improve the capabilities you will need larger and/or more numbers of caches (usualy more sockets). Also consider faster and/or more memory ports.

Jim Dempsey

View solution in original post

0 Kudos
9 Replies
rreis
New Contributor I
1,470 Views
25 % - 1 core
50% - 2 core
75% - 3 core
100% - 4 core

I don't know your program or the memory needs but it may be that the cores are competing for memory access. Another thing (someone who knows better will say) is that maybe they are not binded (each one) to a specific core and that could help performance (making them stay on the same core). As I said it also depends on your program which I don't know about...
0 Kudos
fraggy
Beginner
1,470 Views
Quoting - mukishere
Q5) Why is the exe taking more time to run in this case (i.e around 13s more compared to single instance) ? If the tasks are running on different processors , then ideally shouldnt it take the same time as running a single instance?
Any explanations to clearly validate the observations seen will be highly appreciated. Probably this will give me a better picture of how multi-core processors are useful for single-threaded applications over single core processors.
It's clearly a strange behavior...
What exactly do your exe ? does it open a file ? does it use intensivly the network ? memory race is unlikely but possible, it depends on what your program is doing...
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,471 Views

A1) Your application will tend to run on one core. But unless it is pinned to one core other activity by other applications (e.g. AV, Internet Update, ...) may cause the application to jump cores.

A2) Your applications will tend to run in two cores. But...(see A1)

A3) If your application is entirely cache bound, and performs no writes you might see (closer to) the same run time. However, when your application accesses memory that is not in cache it will compete with the memory access of all other cores (hardware threads). Also, if the application is performing writes to memory, and depending on the locations, it may cause a cache eviction for the other core(s) (hardware threads). As you increase the number of copies of the application you will increase the demands on the memory subsystem. There is also another consideration where the O/S is managing another instance of the application including managing more windows on the desktop.

A4) Yes. But see A1, A2, A3 for additional details.

A5) As you went from 1 instance to 2 instances the memory interaction cost 20% (3/14.5) but this is amortized over two cores so 10% interaction expense, from 1 to 3 cost 44.8% (6.5/14.5) ammortized over 3 cores 15.8%, going from 1 to 4 cost 89.6% (13/14.5) ammortized over 4 cores 22.4%. The interaction expense is about n(5.8% + n*0.4%) where n is the number of hardware threads. Assuming no change in cache size at this rate you would max out (saturate)at 11 cores.

To improve the capabilities you will need larger and/or more numbers of caches (usualy more sockets). Also consider faster and/or more memory ports.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,470 Views

I forgot to mention that the overhead formula n*(5.8% + n*0.4%) will exhibit a stair step when/as you transition from totally within cache to outside cache (twice if you have an L3 cache system).

Jim Dempsey
0 Kudos
mukishere
Beginner
1,470 Views
Hi,

A1) Your application will tend to run on one core. But unless it is pinned to one core other activity by other applications (e.g. AV, Internet Update, ...) may cause the application to jump cores.

Q1a) How do I pin an activity to one core only to prevent some other applications from causing it to jump cores?

Q1b) Is there any tool to verify on how many and on which particular cores a particular activity is running for a multi-core processor?

A3) If your application is entirely cache bound, and performs no writes you might see (closer to) the same run time. However, when your application accesses memory that is not in cache it will compete with the memory access of all other cores (hardware threads). Also, if the application is performing writes to memory, and depending on the locations, it may cause a cache eviction for the other core(s) (hardware threads). As you increase the number of copies of the application you will increase the demands on the memory subsystem. There is also another consideration where the O/S is managing another instance of the application including managing more windows on the desktop.

Q3a) I wanted to study the procesor architecture including how the memory is shared among cores, the caches available, memory ports and how to optimize hardware/software etc for the Dell PowerEdge 2850 Server. This has quad-core Intel Xeon 3Ghz processors (
CPU Type 0, CPU Family F, Model 4, Stepping 3, CPU Revision 5 ) and Microsoft Windows Server 2003, Enterprise Edition OS. Where can I get the above info?

Q3b) I would also need the info described in 3a) for the HP Proliant DL 360 G5 server having Intel Xeon 2 GHz, 8 core processors (
CPU Type 0, CPU Family 6, Model 17, Stepping A, CPU Revision A07) and Microsoft Windows Server 2003, Standard x64 edition.

A5) As you went from 1 instance to 2 instances the memory interaction cost 20% (3/14.5) but this is amortized over two cores so 10% interaction expense, from 1 to 3 cost 44.8% (6.5/14.5) ammortized over 3 cores 15.8%, going from 1 to 4 cost 89.6% (13/14.5) ammortized over 4 cores 22.4%. The interaction expense is about n(5.8% + n*0.4%) where n is the number of hardware threads. Assuming no change in cache size at this rate you would max out (saturate)at 11 cores.

To improve the capabilities you will need larger and/or more numbers of caches (usualy more sockets). Also consider faster and/or more memory ports.

Q5a) Ok , if take the number of hardware threads as one less than the number of cores , then the formula is almost validating the observations. Could you let me know of any article or paper which explains this interaction expense formula in detail? Im curious to know how this formula is derived.


Q5b) Suppose I run the test exe (single -threaded) on a dual core Intel processor each of 2 GHz and it takes 13s. Now could I estimate with reasonable accuracy,how much time the same exe is going to take on a quad-core Intel processor, each of 3GHz ? Will it run faster on the 3 Ghz processor? Detailed explanations about this would be highly appreciated.

Thanks for all the help.









0 Kudos
mukishere
Beginner
1,470 Views
Hi Jim,

Could you please reply? Your inputs will be invaluable.

Thanks,
Mukul
0 Kudos
TimP
Honored Contributor III
1,470 Views
You seem to be asking Jim to publish a textbook here.
As you must be aware, Windows Task Manager provides for pinning a process manually, once it is running. You may be able to assess the value of pinning this way. A more convenient method is to thread with OpenMP, use a run-time library with pinning facility such as the Intel one, and set the corresponding environment variable (KMP_AFFINITY=compact, if using all logical processors for one application). PGI OpenMP apparently defaults to a scheme similar to Intel KMP_AFFINITY=compact. Microsoft OpenMP doesn't provide for affinity, but the Intel "compatibility" library works with Microsoft compiled OpenMP. Intel MPI also provides for optimizing affinity.
Pinning is most likely to show an advantage on a multiple socket machine, and least likely to be needed on a single socket unified last level cache, such as Core 2 Duo or Core i7. In part, pinning has to struggle against Windows, until you adopt Windows 7. In view of the differences among CPUs and OS versions, and the possibility that you may run more than one application at a time, pinning to specific cores inside the application may be more trouble than benefit.
0 Kudos
mukishere
Beginner
1,470 Views
Quoting - tim18
You seem to be asking Jim to publish a textbook here.
As you must be aware, Windows Task Manager provides for pinning a process manually, once it is running. You may be able to assess the value of pinning this way. A more convenient method is to thread with OpenMP, use a run-time library with pinning facility such as the Intel one, and set the corresponding environment variable (KMP_AFFINITY=compact, if using all logical processors for one application). PGI OpenMP apparently defaults to a scheme similar to Intel KMP_AFFINITY=compact. Microsoft OpenMP doesn't provide for affinity, but the Intel "compatibility" library works with Microsoft compiled OpenMP. Intel MPI also provides for optimizing affinity.
Pinning is most likely to show an advantage on a multiple socket machine, and least likely to be needed on a single socket unified last level cache, such as Core 2 Duo or Core i7. In part, pinning has to struggle against Windows, until you adopt Windows 7. In view of the differences among CPUs and OS versions, and the possibility that you may run more than one application at a time, pinning to specific cores inside the application may be more trouble than benefit.

Hi tim18,

You are right..I did realise that I was asking Jim a lot of stuff that could be have been done with some effort :-)
Thanks for the explanation about pinnning..


0 Kudos
mukishere
Beginner
1,470 Views

I forgot to mention that the overhead formula n*(5.8% + n*0.4%) will exhibit a stair step when/as you transition from totally within cache to outside cache (twice if you have an L3 cache system).

Jim Dempsey

Jim,

Could you please point me to some resources which talk about this formula in detail?

0 Kudos
Reply