Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

How to choose number of processes to start in multi-core programming?

dpeterc
Beginner
1,592 Views

My program does not actually use multithreading, it just divides the calculation among several processes, and the parent process collect the result in shared memory, once they finish. All the code is CPU bound, no I/O, and relatively small footprint (20 MB per process).
I started with this approach when the first multithread Pentiums4 came out, but it did not improve the speed at all.

With advent of true multicore processors (Pentium D), I got almost linear speedup, according to the number of cores.
So my logic was pretty straightforward, start as many processes as there are cores, and ignore the threads. I tested the program on Atom, N270, it has one core and two threads, and program was about 30% faster, when I ran it with two processes. Did not know what to think of it.

But recently I tested my program on 4 core 8 threads processor (i7-3610QM), and I got surprising results. My best speedup was when I started 6 processes, not 4 or 8. These are the times:
Processes   time(seconds)
1               4,07
2               2,09
3               1,44
4               1,09
5               0,87
6               0,76
7               0,99
8               1,01

Since I can not test my software on all processor configurations, I need some reliable formula to calculate optimal number of processes for best speedup.
I could also make a benchmark and set the optimal value during program installation, but I dislike such zero knowledge approach.

I no longer have a clear mental model of what a core vs thread is on the CPU. Note that I am not talking about programming models, but about hardware capabilities of the CPU. So thread is like half of extra core, but not quite ;-)

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,592 Views
This (new) forum somewhat trashed your post, but I get the general idea. I have a couple of suggestions (assuming you want additional performance) 1) Consider replacing fork/exit(0)/waitpid with a thread pool model paradigm such as: OpenMP, Cilk Plus, TBB any of which will reduce the thread create/destroy overhead (assuming more than one frame generated) 2) With thread pool implimented, then consider tiling the work into cach favorable sized (i.e. may use more tiles than threads) 3) Many display drivers permit multiple display buffers (displaying one), your code sample was not clear enough to determine if you are using double buffering or not. If not, consider double buffering: building into one while displaying the other the swapping pointers Jim Dempsey

View solution in original post

0 Kudos
8 Replies
SergeyKostrov
Valued Contributor II
1,592 Views
dpeterc wrote:

...Since I can not test my software on all processor configurations, I need some reliable formula to calculate optimal number of processes for best speedup.
I could also make a benchmark and set the optimal value during program installation, but I dislike such zero knowledge approach.

I no longer have a clear mental model of what a core vs thread is on the CPU. Note that I am not talking about programming models, but about hardware capabilities of the CPU. So thread is like half of extra core, but not quite ;-)

[SergeyK] Did you try to verify your real results with Amdahl's Law for Parallel Computing? ... en.wikipedia.org/wiki/Amdahl's_law
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,592 Views
I think the issue you are observing (7 and 8 processes slower) has to do with the process startup overhead. Experiment with adjusting your test program such that a single process version runs about 60 seconds (either more data or loop over the same data). Then run the test using 1, 2, 3, 4, 5, 6, 7, 8 processes. On a single processor, or SMP system, you should find running one process with multiple threads more efficient. An exception for this is if your system is 32-bit with large memory and where each slice nears virtual memory capacity for a 32-bit process (~2GB or ~3GB). Jim Dempsey
0 Kudos
dpeterc
Beginner
1,592 Views
Thank you for your comments, Sergey and Jim. I am familiar with Amdahl's law, but I think in my case it does not kick in so fast. I have changed the workload per process by increasing it three times. What I measure is a full screen refrash for a complex simulation. In previous measurement, I have set the image buffer height at 350 pixels, which the processes then divide among themselves and fill in the results. I have increased the buffer to 1050 pixels, so with 8 processes, each one calculates 1050/8=131 pixels. Each porocess uses about 20 MB or RAM and will never fill a GB. Fork on Linux is very fast. The results below are for the overall full screen refresh, so they do not increase. Processes time(seconds) 1 4,17 2 2,20 3 1,44 4 1,09 5 0,87 6 0,75 7 0,90 8 1,01 The same pattern emerges, with best result with 6 parellel processes. Times are not very different, so fork/starutp time is not importnat, in my case. So I think the speedups are related to the amount of parallelism a processor can actually acheive. Four cores give me true paralellism, while double threads in each core (8) do improve the time, but are not able to give me 100% paralellism. My hunch is that the nomenclature in processor description http://ark.intel.com/products/64899 the number of cores translates to number of processes that can be executed in parallel, while number of threads does not translate to actual software threads or processes, but to some level of parallelism in execution in some cases, but not always. Did anyone get close to linear 8 times speedup on a 4 core 8 thread intel CPU, on an actual complex software? Maybe I need to fork the money for a new computer with 6 or 8 core processor, and then I will be able to measure the application and test my hypothesis ;-(
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,592 Views
What you may be seeing is a symptom of a choke point. >>What I measure is a full screen refrash for a complex simulation If your test program is performing a screen refresh as opposed to shunt output to NULL (or faster discard output), then the choke point is your screen refresh. In situations like this, consider using a parallel pipeline where the final merge to output device (display driver) is performed by a thread not present in the compute thread pool. IOW not main thread, not OpenMP thread, not TBB thread. Consider using an ancilliary thread either of higher priority than compute pool threads or undersubscribe compute pool threads by 1 thread. Or you could look at using QuickThread parallel_pipeline (www.quickthreadprogramming.com) which uses two classes of thread pools (compute class and higher priority I/O class). When using ancilliary thread, be sure to reduce its work to as little as possible and assure it does not burn up CPU time in useless/unnecessary spin wait. Jim Dempsey
0 Kudos
dpeterc
Beginner
1,592 Views
Hello Jim, Thans for sharing your suggestion. I have tested it by skipping the actual screen redraw, but there is virtually no change in performance. My software uses shared memory to communicate with the graphics system, I render directly in the buffer, so sceen display is a matter of bitblit, actually memcpy between main memory and graphics card. The X window test tool x11perf -shmput500 gives me 12000 reps @ 0.4742 msec ( 2110.0/sec): ShmPutImage 500x500 square So I can draw 2100 500x500 blocks per second, more than enough. My way of collecting results is very simple, each process gets its share of image to calculate and they all draw in the same shared memory segment (but each in its own part). Main process waits for the chidren to die and draws the image to the screen. Here is the actual code, to dispel any doubts: static void forkFabricCalc(int srcxpos, int srcypos, int dstypos, int wid, int hei, int totaly) { switch (fork()) { case -1: perror("The fork failed!"); break; case 0: // child renderFabricImg((unsigned char *) bufXIM->data+bufXIM->bytes_per_line*dstypos, bufXIM->bytes_per_line, bufXIM->bits_per_pixel/8, srcxpos, srcypos, wid, hei); _exit(0); break; } return; /* parent only returns control to caller */ } ... for (ypos=0; ypos5000) && ((height-ypos)>64)); if ((numCPUcores>1) && multiCoreBand) { hei1 = hei0/numCPUcores; hei0 -= (numCPUcores-1)*hei1; } else hei1 = 0; /* make the warnings go away */ if ((numCPUcores>1) && multiCoreBand) for (i=1; idata, bufXIM->bytes_per_line, bufXIM->bits_per_pixel/8, srcx+xpos, srcy+ypos, min(BlockX, width-xpos), hei0); if ((numCPUcores>1) && multiCoreBand) for (i=1; i
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,593 Views
This (new) forum somewhat trashed your post, but I get the general idea. I have a couple of suggestions (assuming you want additional performance) 1) Consider replacing fork/exit(0)/waitpid with a thread pool model paradigm such as: OpenMP, Cilk Plus, TBB any of which will reduce the thread create/destroy overhead (assuming more than one frame generated) 2) With thread pool implimented, then consider tiling the work into cach favorable sized (i.e. may use more tiles than threads) 3) Many display drivers permit multiple display buffers (displaying one), your code sample was not clear enough to determine if you are using double buffering or not. If not, consider double buffering: building into one while displaying the other the swapping pointers Jim Dempsey
0 Kudos
dpeterc
Beginner
1,592 Views
Hello Jim, Thanks for your comments. I will try to implement them; time permitting. Actually current performance is not bad, so this is not a primary concern; a more advanced thread model in not an absolute necessity. Luckily, my rendering code is what they call "embarrasingly parallel". http://en.wikipedia.org/wiki/Embarrassingly_parallel I was just wondering how to choose the right number of processes, based on hardware characteristics of the processor (number of cores and threads). I will post more results once I buy a computer wiht 6 or 8 core processor.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,592 Views
With a memory mapped display buffer you may think you have nothing to worry about. However, you might not have asthetically an appealing display should you update to the active display buffer while it is being transferred to the display. You generally correct for this by either: a) pushing the data into the display buffer between paints. This used to be called the vertical scan blanking interval for analog displays, but digital displays have something similar. This may offer a narrow window for you to complete you work when using only one display buffer b) double buffering the display such the you have the full time bewteen swap buffer (just before start of paint to display) and next swap buffer. This provides you with more time to render the next frame. You can extend this to addtional buffers if necessary, but this is trivial to do once you have double buffering in place. Jim Dempsey
0 Kudos
Reply