Multi-process Engine - yay or nay?

jwendin · ‎09-06-2008

Hi all!

First of all, please excuse any language errors - as English is not my native language.

I'm currently in a rather nice position to be - being able to really take the time to design our next game engine from scratch - with a (at the moment) very narrow planned user base (Windows Vista+, Direct-X 10+ and most likely multi-core processors under the hood) - with no planned support for downscaling to lesser cards or sidestepping into other Operating Systems apart from later incarnations of Windows.

Thus, in order to fully utilize the potential of the system at hand - I will have to utilize the amount of available cores at our disposal as well as possible.

Now, the initial idea was to start threading the stuff that could be run in parallel - but I was thinking about if another approach would potentially lead to better parallelism and scale better with potential future many-core processors.

The other approach consists basically of a host "kernel process" and separate worker processes (not threads but actual processes) that each have predefined tasks. The worker processes are to be buffered - and use shared memory (through memory mapped files) and use some sort of compare-and-swap to handle messaging between the processes. This would imply a slight latency between what is rendered and what the world state is in - but hopefully I can arrange that latency to hit certain tasks harder than others. (keeping input and local actor highly up to date while allowing a bigger latency for less important actors)

Now, it might seem such an approach is overly complicating things - but it would have a few nice perks to go along with it. First of all, it would force us to think about parallelism at all times, since we can never be sure in what order things will occur without a sync-lock from the kernel process. We'd automatically have less locks on data due to the separate application pool and buffered nature of the system and it gives a rather nice way of handling the update rate of specific tasks. We'd also potentially be able to detect crashes of separate subsystems from the kernel and try to handle those gracefully.

Now - before I get to involved into the design of the second approach, I have to ask: Am I shooting myself in the foot here? Would performance plunge by utilizing separate processes with shared memory to handle the interchange of data?

I'm fully aware that buffering the data will imply a higher memory cost - but regardless of that fact and assuming I could pull it off - would it work at least near a standard multi-threading solution, performance-wise?

Dmitry_Vyukov · ‎09-08-2008

jwendin:

Now - before I get to involved into the design of the second approach, I have to ask: Am I shooting myself in the foot here? Would performance plunge by utilizing separate processes with shared memory to handle the interchange of data?

If you initially construct messages in shared memory, i.e. you don't need to serialize/deserialize messages, then I don't see any reasons why multi-process solution will be slower than multi-threaded.
If you construct messages in normal memory, and then serialize/deserialize every message, then it's obvious reason for performance degradation. How much slower it will be? It depends on size of messages, complexity of messages, frequency of enqueueing/dequeueing. For example, if size of message is 1MB, then multi-threaded message passing will issue only few machine instructions to send/receive message, multi-process solution additionally will have to copy 1MB of memory.
Others than that I don't see reasons for difference in performance of multi-threaded and multi-process solutions.

jimdempseyatthecove · ‎09-08-2008

Jwedin,

Before you duplicate the work of others you should go to www.gpgpu.org and look at the programming models such as Brook+ and CUDA. This site addresses the issues of using graphics cards for processing. The memory model and process interaction with GPU is very similar to what you have stated.

In a typical system, you have your PC/MAC which may be multi-core and one or more high-end graphics cards installed. The data interface between the application and the GPU is shared memory and/or DMA pipe. Depending on the motherboard you may have between 1 to 4 of these GPUs attached to the system (nVidia has an external box optiontoo). Each of these GPUs can currently have 100's of processing elements and GB's of on-board RAM.

Also, if you monitor the buzz around Intel, you might see some mention an in-developmentproduct that may attach to the system in a similar manner but be more compatible with host (PC/MAC) instruction set architecture.

By selecting an appropriate video card now, and using the appropriate software tools (e.g. Brook+, CUDA and others) you can experiment with various configurations now. Then as the hardware develops, your software may need only a port as opposed design and implement.

Jim Dempsey

Dmitry_Vyukov · ‎09-08-2008

JimDempseyAtTheCove:
Before you duplicate the work of others you should go to www.gpgpu.org and look at the programming models such as Brook+ and CUDA. This site addresses the issues of using graphics cards for processing. The memory model and process interaction with GPU is very similar to what you have stated.

In this context I also have to mention RapidMind platform:
http://www.rapidmind.net/
The main distinguishing feature of RapidMind is that it can work transparently on multicore CPU or on GPU. User don't have to worry about this.

Also there is OpenCL:
http://en.wikipedia.org/wiki/OpenCL
Which also works on CPU or GPU.

And Stream Computing:
http://ati.amd.com/technology/streamcomputing/

jimdempseyatthecove · ‎09-08-2008

Dmitriy,

I am currently adapting a Fortran based finite element analysis program to use ATI's (AMD's) stream computing language Brook+. Other than for some small test apps, I've just begun the integration. Seeing as how the tools are all geared towards C++ the adaptation is a little less straightforward than I would like (more wrappers).

My biggest problem in using Brook+ is lack of detaileddocumentationand examples that are geared towards general computation (which are not in the area of video processing). I think the biggest problem with the documentation is written from a mindset that relatively simple operations do not warrant examples or explanation. (what would anyone expect for free software).

One of the earliest problems I had to resolve was a concept of persistent streams. This is useful for reducing the number of streamRead's and streamWrites why copy data into the GPU if it is already there. I think I have that working now but I won't know for sure until I get enough of the code integrated and perform calculations both ways (CPU and GPU) and compare results. Not having an update (read/modify/write) stream means more RAM on the GPU will be required formultiple buffers (and flags indicating which is current). For now this is not a problem.Later I may have to pool the persistent objects to conserve RAM in the GPU.

The second technique I want to develop is to try to reduce the number of copy constructors when transitioning from the user C++ code to the Brook+ generated C++ code to the C++ portion of the kernel interface code. In most cases the stream could be passed by reference (pointer) but instead the code currently requires a copy constructor. There is a significant amount of computation in performing these copy constructors. Reducing theunnecessary overhead willmean the GPU will be effective on smaller data sets.This optimization I will defer until after I get the current coding technique working.

Jim Dempsey