Request data from the host

Christian_E_ · ‎12-11-2013

Hallo,

I don't have much experience with the Xeon Phi yet. I am trying to convert a program to use it with a Xeon Phi. The programm processes lots of data file by file. Originally it was parallelized over the files over OpenMP like this:

#pragma omp parallel for schedule(dynamic)
for (int fileid = 0; fileid < numfiles; ++fileid)
{
// get filename, open file and load data
// do stuff
}

The files contain chunks of data and it is not possible to hold all the data in the host memory at the same time. Since I cannot access files directly from the device I think there are only two possible solutions:

1. Load the data on the host, copy the data to the device and run the computation in a thread there
2. Offload the OpenMP loop to the device and whenever a file has been processed, call the host to load the next one (if that is even possible)

The problem is: I would like the conversion to be as non-intrusive as possible, meaning: The presence of a co-processor should not be required. This makes it difficult to dynamically spawn a thread on the co-processor because I would need to keep track of the number of cores currently in use (which at the very least requires runtime information on the number of cores available).

So I have two questions:
1. Since OpenMP makes it easy to adapt to platforms with varying number of cores: Is it possible to solve this problem using a combination of Offload- and OpenMP pragmas, or do have to do the scheduling myself?
2. Is it possible to request data from the host?

Thank you very much

jimdempseyatthecove · ‎12-12-2013

How much host RAM do you have?

The Xeon Phi has 8GB (or 16GB) of RAM. So the memory limitations you see on the host may exist on the Xeon Phi as well.

The "do stuff" code... does that have parallel regions?

Do you have one or more Xeon Phi's?

Are you modifying the input file data and writing it (all) back out?

After processing the input data, what size is the output?

Possible answers for your questions

1) While the offload is capable of fallback to run on host (if no mic, run on host), due to the stated memory limitations (and dependent on results data), your likely best approach would be to survey what resources you have and have your main file picker dispatch intelligently such that neither host nor Phi(s) run out of memory. When scheduling to host, pass the input buffer pointer and reserve that buffer until processing is complete, when scheduling to Phi, use in(buffer,...) with signal and no code body (iow) data transfer, once data in Phi, recycle buffer on host, then issue offload for code using prior copied data. While this is more complex, the approach can also work using MPI (iow the work done for Phi can be exploited with MPI).

2) Normally the host pushes/pulls data to/from Phi. Consider having the host push tokens to the Phi, and when Phi wants data it completes the token (i.e. token returns to host) and then the host takes action, in this case supplying additional data to Phi.

If you do not have it already, get "Intel Xeon Phi Coprocessor High-Performance Programming" by Jim Jeffers, James Reinders. This book will give you a good overview of different approaches to your problem.

Jim Dempsey

Christian_E_ · ‎12-12-2013

Thanks for your suggestions. I will definitely have a look at that book you recommended.

Just for clarification:

1. The //do stuff area does not contain any more parallel regions. The parallelization takes place entirely over the files.
2. I have two Xeon Phi cards installed
3. The input data file is never modified. The size of the output data is relatively small (a few MBs per input file).
4. Host RAM is 128GB. The total amount of data to be processed ranges between 350GB and 500GB. As long as the data is processed in chunks everything is fine.

Currently I have "solved" the problem in the following way. I still use OpenMP on the host to load the files and to push the data to the Phi. IN order to keep all cores busy I have manually set the number of threads. Every thread on the Phi now has a corresponding thread on the host which initiates the offload and then blocks until the computation is done.

I don't really like this "solution" because the user has to specify the number of cards and the number of threads per card so that I can spawn the right amount of threads to fully utilize the Phi. And 400+ threads on the host just to push data around seems really wasteful (although the performance impact is neglegible since most of them are in a waiting state).

Technically I could live with the way how things are working at the moment but I was wondering if there way a better way.

jimdempseyatthecove · ‎12-12-2013

I am still rather new at using Xeon Phi myself. I have two installed on a 1P system (Xeon E5-2620v2). So my suggestions come more from intuition that from experience (my intuition is very good).

I am fairly certain that multiple OpenMP threads can have concurrent offloads running on Phi (assumption on my part)
I know a single host thread can have an asynchronous off load pending.
I do not know, but it should be simple enough for you to test, if a single host thread can have multiple asynchronous offloads *** running asynchronously on the Xeon Phi. *** meaning asynchronous offloads issued by the host thread run independently on the Phi, as opposed to being enqueued.

In the situation where the asynchronous offloads on one thread on the host (from perspective of host) run synchronously (batched) on the Phi, then on the host, oversubscribe the threads on the host and then use synchronous offloads.

In the asynchronous/asynchronous situation then the main file fetcher thread can directly post the Xeon Phi tasks as outlined in earlier message.

Assume ~10MB required per file context (input, scratch, output, code). 240 threads used per Phi would consume 2.4GB of RAM on the Phi. So memory capacity would not be an issue assuming your file picker doesn't assume infinite number of threads/buffers. You could code this somewhat like a pipeline with a limited number of buffers (total number of concurrent tasks running on both Phi's an host). Determining the amount of resources is relatively easy

int numberOfTasks = omp_get_max_threads() + _Offload_number_of_devices() * 240;

There is likely a function to return the number of threads/cores on the offload device.

Jim Dempsey