OpenMP parallelization doesn't work when the program calls a driver in order to utilise an FPGA accelerator card.

sfalc · ‎07-21-2008

I'm integrating a module of an analytical program bundle with a custom FPGA accelerator card. The problem at hand is heavily CPU dependent and the FPGA is used for some of the heavy lifting for several parts of the complete package of which the module is one. The amount of data is very small and the problem greatly benefits from parallelization.

The physicist who has written the module doesn't want to share the source code to it for some strange reason so I don't have it, just a binary blob.

The whole package looks like this: The Front-end (a gui) talks to the modules who talks to the DLL, the DLL sends configuration data to WinDriver that passes the configuration data to the FPGA. The configuration is different for different modules. The FPGA card has a PLX PCI controller. Jungo WinDriver is the driver for the card.

Since the physicist involved in the project doesn't want to bother with the accelerator card, he has a dummy DLL that just returns an OK message.

When my client tested the software with card everything worked all right when they used the dummy DLL. The bulk of the module running on two quad core processors worked utilising 100% on four cores and 30% each on the remaining four.

But when they used the real DLL that calls WindDriver and the card, only one core was used and at 100%. The call to WinDriver somehow messed with the OpenMP parallelization.

So I'm wondering if the Windows XP scheduler is trying to squeeze every thing that calls WinDriver on the on to the same processor for some reason or if there's some other explanation.

I know that there are API calls to set the processor affinity in Windows but I'm not familiar with them.

1 Front-end

2 Module1 Module2 Module3

3 DLL

4 WinDriver

5 Card with PLX controller and Xilinx FPGA.

Platform:
Windows XP 64 Pro with Visual Studio 2005
Two Intel Xenon quad core processors

The Fortran module is built on the same platform with the
Intel Fortran Compiler 10.1.021 and the Intel OpenMP math libraries.

Any help would be greatly appreciated

/Lars Malmqvist

TimP · ‎07-21-2008

If this hardware is invoked in the middle of a parallel region, does surrounding the function call with a critical region help?

jimdempseyatthecove · ‎07-22-2008

Check on tim18's suggestion first (use critical section if calling within a parallel region)

An alternative is to structure the code such that only 1 thread (usualy the master) makes the call to the hardware driver. (see $OMP MASTER).

If your hardware has an initialization section where working memory buffers are specified (or context/completion data is specified) .AND. if this data is stack local to that thread (usualy master) performing the initialization, then (only then) calling by multiple threads results in threads other than master using invalid buffer addresses (not in their stack).

You might also find success in

initialize FPGA
sequential processing
begin parallel region
CPU processing
end parallel region
FPGA processing
begin parallel region
CPU processing
end parallel region
...

If you need (desire)to perform FPGA processing within parallel regions and if critical sections do not resolve the problem then consider reworking your parallel control loops such that the master thread performs all the FPGA calculations. This will be a bit more complicated programming. This should be well within the skill level of a competent programmer.

Also, in lieu of performing the FPGA code by OpenMP master thread, your application could spawn a non-OpenMP thread to perform all the FPGA processing by way of receiving requests (messages) from the OpenMP threads.

Jim Dempsey

sfalc · ‎08-05-2008

Thanks for all the helpful advice! The developers behind the main program assured me that they used the card the way you suggested, in the beginning before any parallelization. So after some digging the flaw or fault was found in the WinDriver PCI driver from Jungo. Two API calls to open the card took several seconds!

The reason this wasn't caught earlier was since the software originally ran on AMD processors and for some reason the API calls are perfectly functional there.

So again, many thanks!

/Lars Malmqvist