Python and FFMPEG on PHI??

Bob_A_ · ‎09-16-2014

I am new to parallel computing and wondered if running my Python app that calls FFMPEG to crunch edit videos would be a good fit for the 61 processors on PHI?

My developer friend made the app to be able to use all existing cores on a Windows machine which it seems to do very well. It increases the throughput of the app by a factor of x times the number of cores. But he said that better find a multi-core Windows computer with more cores if I want more performance because he has not implemented messaging to make it be really parallel.

The app builds video of words in a sentence word by word hence the ability to work on separate words in parallel and then at the end stitch all the video pieces together in a single thread. The threads all look at the same source read only data so there are no collisions by working in parallel. And the video pieces are very small till they are put all together.

Does what I describe sound like an opportunity for PHI? I hope so because the job takes 5 days to run and with a PHI multiplier I can anticipate that the same job will run in 4 hours using the extra cores.

My questions are ...

1) Will the Python code as I describe it have to be altered much or will it run as is and make use of the 61 extra helping hands?

2) Will memory be an issue?

3) Is there a way I can test the code out on a PHI before I buy or run it on a cloud some where?

Thanks!

Bob

Bob_A_ · ‎09-17-2014

I guess from the silence... I should think that my application is not good for Xeon Phi?

Maybe, because memory??

Frances_R_Intel · ‎09-18-2014

Sorry for the silence.

The answers to your questions, as best as I can determine (others may know better than I):

1) It isn't really possible to tell without looking at your code but one of the issues is that, as far as I have been able to determine, no one has ported FFMPEG to the Intel® Xeon Phi™ coprocessor yet. Whether this code will run well or not depends on how well the FFMPEG code vectorizes. As I understand image processing software, vectorizing the code is not always straight forward. In just blindly compiling the code myself, I found that about 10% of the loops vectorized. I didn't determine if these are major loops or just small loop that don't take much time. I did not try to run the code. If you can't find an existing port of the code, you would probably want to spend some time seeing if there is some way to get the more important loops to vectorize.

As to running 61 threads, each processing a single instance of an image - each core on the coprocessor can process up to 4 threads at a time. In order to keep the core busy, you would want to run at least two threads on each core (minus a core to allow the operating system and your Python to run. This could cause some issues. (See the answer to the next question.) You might want to instead run fewer images at one time but thread the work on each image. I noticed that the FFMPEG code contains calls to the pThreads library. I don't know how well it threads and didn't find any information in my search of the Internet.

As to using your Python code as is - if it were me, I would use the coprocessor to process the individual images, then send then to the host system to be combined into the final video. This could be done in parallel with the image processing (collect a bunch of images and make part of the video, collect a bunch more and add them to the video and so on.) Since the offload model of programming is not supported in Python at this time, this would mean using MPI or writing your own code to handle the communication.

2) Memory might be an issue if you run 120+ separate images at one time. The coprocessor does not have any directly attached disks and is not, by default, configured for swapping. The default file system is set up as a RAM disk. So you would need RAM disk space for the system files and libraries, your input file and the completed images. Then you would need the working memory for each image as it is processed. You could take the pressure off memory by threading your code (using the pThread calls?) so that you process fewer images at a time and using network mounted disks (NFS, Lustre and the like).

3) I believe there are some sites that will sell time on their systems or make you a grant of some time if you are doing interesting, publishable research. I don't have any names at this time. You can always use the "send author a message" link if you want to talk about this some more. As I say, I don't have any names at this time, but if you want, we can talk about this some.

Bob_A_ · ‎09-18-2014

Thanks Frances! Because I am new to HPC your answer is probably more than i can handle... but much appreciated!

I have one Python app that farms starts other Python app(s) in parallel ... each chewing on a well defined part of the over all work load.

From what you describe... I guess the main CPU could run the delegator Python app and it could in turn start the slaves python app(s) on each of the 60 cores. They each in turn call FFMPEG at points to do needed tasks in the slave process. That does not fully utilize the 4 threads per core unless FFMPEG uses the other 3 on the individual core(s).

In this scenario as best I understand from your description... I would not need to use MPI?? Each slave acts as an independent slave from the calling python app and works not needing to communicate with the other slave jobs.

As you described it though the issue is memory, RAM and storage. They all need access to read only source data of the size of about 1TB.

1) Would the slave app be able to read the computers hard drive directly?

2) Are each of the slave apps all running from the same shared RAM or is it dedicated RAM per core? How much? So the default RAM disk can be on my SSD shared with the OS or does it need to be a separate drive?

Thanks so much!

Bob

Charles_C_Intel1 · ‎09-19-2014

Bob:

The Intel(r) Xeon Phi(tm) coprocessor is a standalone Linux system with no attached storage, and no ability to swap virtual pages to the host. So the physical memory on the coprocessor is used for everything: running Linux, running your program, and containing the RAM disk. Anything that tries to allocate more memory than is physically on the coprocessor, it will crash (remember, although Linux gives us virtual memory, it all must be physically backed by RAM since there is no pagefile on the coprocessor). Usually it is the programing being run that does this and crashes. :-)

So, if you have more data than fits within the coprocessor, you will need to get it from the host system (and any storage it is attached to) into the coprocessor rather than storing it in the RAM disk on the coprocessor. That can be done using an offload model, using MPI as you suggest, or by mounting a remote networked filesystem (NFS, LFS) on the coprocessor so a program running on the coprocessor can read/write to remote file systems. Since you have a lot of frames to process, you want the next frames to work on to be streaming in (using multiple threads or processes) while you are processing the current set of frames (on another set of threads or processes), and independently writing back the previous set of processed frames (more threads or processes). You want to keep as many threads as busy as possible at all times. This is already quite different from how your program works now as I understand it. And once again, all this needs to fit in the memory of your coprocessor. The compute module you will need for your problem on a coprocessor is often called a streaming compute model.

Personally, unless the application your code is running is extremely well optimized, it's not going to get everything out of a coprocessor core that is possible (Francis' observation of about 10% vectorization suggests we are in the latter camp). To use the coprocessor effectively, it needs to be >90% vectorized, >90% parallel, and using memory effectively. Each MPEG processing task may run better on more than one core, rather than just one, since each core may not be running optimally. You'll also want to keep process creation/shutdown and memory allocation to a minimum if the time to do those things is on the order of the time to run the MPEG code on a frame, so it's unclear if you will want to run Python on the coprocessor (I forget if anyone has ported Phython to the Intel(r) Xeon Phi(tm) coprocessor). If there is a coprocessor Python implementation, and they implemented threads well, it may "just work," or you may need to have a C program doing the work on the coprocessor.

As to whether what you want to do is a good fit for Intel(r) Xeon Phi(tm), it's hard to say given that I haven't heard of people trying to do MPEG encoding using the coprocessor and I haven’t tried it myself. It's likely that it can be done, but like so many things in computing, it's not clear it can be done well, and doing it well can sometimes require a lot of effort. Complications in this case are that you need to do a lot of copying of data into and out of the coprocessor, and the number compute operations per memory access or I/O may be low. Our coprocessor really shines when it can sit and crunch on a bit of contiguous memory for a while. MPEG needs to look at neighboring pixels in two dimensions as well as adjoining frames (so not great memory locality). That's already striking me as non-trivial unless that single-process program you have on Windows will run effectively as a single process across all cores on the coprocessor. Again, it's hard to say if your current implementation would be the best way to solve your problem on the coprocessor, while it seems to be serviceable on your host system.

Sorry to conclude with a firm "maybe, but..." :-)

Charles

Bob_A_ · ‎09-19-2014

Many thanks Charles and Frances!

It looks like my desire to do minimal re-coding would not be the case in re-hosting my current app using Phi. I much better understand the reasons why not from both of your posts. I look forward to next year when Phi will morph into a sure enough CPU based processor with free access to system RAM running Windows. :-)

Thanks!

Bob

Charles_C_Intel1 · ‎09-23-2014

Bob:

Even more reason to find an algorithm with better parallelization, memory utilization, and vectorization. Optimize the life out of your code using the desktop/workstation/server system with the most cores you can find. If it doesn't run really well on that, it doesn't have much chance of running well on the machine to which you allude.

BTW, the coprocessor is already a full-blown computer. It's just trapped in the card "huge cosmic powers, itty-bitty living space". :-)

Charles