MPI + OpenCL Altera

Altera_Forum · ‎03-06-2017

Hi everybody,

I am a master's student, I study at University of São Paulo (USP) - Brazil. Here we have an Stratix V development kit from Bittware (model s5phq_a7). I'm using OpenCL for a research project, project is a co-design of a legacy software. The board with a custom platform, so I don't to create one and all the communication is done through PCIe. The software uses MPI, which has causing me trouble with Altera OpenCL.

OpenCL demands to create five structures for properly working (context, program, device, kernel and queue). If I have only one process doing all the job, I have no problem during execution. But if I try to start more than one process a serious problem happens, the entire system freezes because the two (or more) process are trying to send AOCX to the same board (I have only one board, and I need to use more than one process per board). I already tried to create program structure only once (only the master creating it), and the slaves to use it, but when I do it, the slaves get segmentation fault because program structure was not created.

This happens when I set program structure, this structure is responsible to reconfigure the board through AOCX file. I would like to know if it's possible to create this structure only once and use for all process, or if OpenCL Altera allows me query the hardware reconfigured on the board and get a pointer to it. I know MPI does not allow to share variables or structures between processes. So share program structure is impossible.

Bellow It's part of my source code to show how I'm declaring theses structures:

bool init_opencl(int *n, boll *success){

cl_int status;

int ARRAY_SIZE = *n;

int nnz = ((ARRAY_SIZE*(ARRAY_SIZE-1)-1)/2);

*success = false;

if(!setCwdToExeDir()) {

*success = false;

return false;

}

platform = findPlatform("Intel(R) FPGA SDK for OpenCL(TM)");

if(platform == NULL) {

printf("ERROR: Unable to find Intel(R) FPGA OpenCL platform.\n");

*success = false;

return false;

}

device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));

context = clCreateContext(NULL, num_devices, device, &oclContextCallback, NULL, &status);

checkError(status, "Failed to create context");

std::string binary_file = getBoardBinaryFile(PROGRAM_FILE, device[0]);

if(*id_process == 0){ // I tried to make only master to create the program structure

program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);

}

status = clBuildProgram(program, 0, NULL, "", NULL, NULL);

checkError(status, "Failed to build program");

kernel.reset(1);

kernel[0] = clCreateKernel(program, KERNEL_FUNC, &status);

checkError(status, "Failed to create kernel");

...

}

I hope I had been clear about my problem.:)

Carlos.

Altera_Forum · ‎03-07-2017

I think I saw a reference somewhere (probably Altera's OpenCL documents) that said it is possible to program the FPGA with your aocx file offline (rather than at runtime using the host code) and then set an environmental variable when running the host code to prevent runtime reconfiguration. If you put all the kernels that are needed by all of your processes in the same cl file and compile all of them into one aocx file, it might work. But of course there are a million things that could still go wrong in the process: e.g. Altera's runtime might not at all allow two different processes to access the same FPGA board simultaneously. Apart from that, if this actually works, it will probably only work if your processes are each accessing different kernels; two processes accessing the same kernel will more likely than not fail.

Needless to say, there is no logical reason to "parallelize" accesses to an accelerator (be it GPU or FPGA or anything else) using MPI on one machine; you can simply create multiple queues under one process and run as many kernels as you want in parallel in the same context. My recommendation is to rewrite your original code to do this instead of using MPI. But of course if you want to scale your code over a network of FPGAs on different machines, then the original MPI-based approach will be the correct solution.

Altera_Forum · ‎03-07-2017

"I think I saw a reference somewhere (probably Altera's OpenCL documents) that said it is possible to program the FPGA with your aocx file offline (rather than at runtime using the host code) and then set an environmental variable when running the host code to prevent runtime reconfiguration.", Could you tell which document is this? I have read all the documents (getting started, best practices and programming guide), and I dind't find anything related to that. I have found a way to change harware image, but I think that's not the right way (as I far as I understood flashing a new image should be provided the manufacturer), since this is used to update the DMA and PCIe hardware.

"Apart from that, if this actually works, it will probably only work if your processes are each accessing different kernels; two processes accessing the same kernel will more likely than not fail."

Could you further explain why this may be not possible?

" My recommendation is to rewrite your original code to do this instead of using MPI", Rewrite the code is not an option, this software is about 400 thousand lines of code (and very old). So, removing MPI is not an option. In order to use correctly, should I share program and context?

Thanks for helping me :)

Altera_Forum · ‎03-08-2017

--- Quote Start ---

Could you tell which document is this? I have read all the documents (getting started, best practices and programming guide), and I dind't find anything related to that. I have found a way to change harware image, but I think that's not the right way (as I far as I understood flashing a new image should be provided the manufacturer), since this is used to update the DMA and PCIe hardware.

--- Quote End ---

Took a bit of searching but I found it, it is in this document (https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/opencl-sdk/ug_aocl_s5_net_platform.pdf).

Check section 2.11, Table 2. If you program the FPGA offline using "aocl program *file*.aocx" and then set the "CL_CONTEXT_COMPILER_MODE_ALTERA" environmental variable to 3, this should avoid runtime reconfiguration.

--- Quote Start ---

Could you further explain why this may be not possible?

--- Quote End ---

The fine-grained scheduling that you get on a GPU is not available on an FPGA. Each kernel will have it's own specific circuit on the FPGA and two processes cannot "share" this circuit; best case scenario one process will access the circuit, finish execution, then the second one will start (no parallelism). Worst case scenario this will result in some runtime error. On the other hand, if the processes are accessing different kernels, since each kernel has its own circuit, all processes can run in parallel and the only shared resource will be the DDR memory. If these cases work at all, the latter case will probably be much more likely to work, but you'd never know until you try. If you absolutely need the processes to run the same kernel, you can probably create multiple copies of the same kernel in your cl file but with different names, and call each copy from a different process.

--- Quote Start ---

Rewrite the code is not an option, this software is about 400 thousand lines of code (and very old). So, removing MPI is not an option. In order to use correctly, should I share program and context?

--- Quote End ---

I am not really sure; you could try creating new MPI types for context, queue, program, etc. and create the context and everything on the root node and send them from the root process to other processes, but there is no telling what would happen at run time.

Altera_Forum · ‎03-08-2017

Thank you very much for all the information. It was really helpful. :)