Re:Intel FPGA AI Sutie Inference Engine - Page 2

RubenPadial · ‎02-09-2025

Is there any official documentation on the DLA runtime or inference engine for managing the DLA from the ARM side? I need to develop a custom application for running inference, but so far, I’ve only found the dla_benchmark (main.cpp) and streaming_inference_app.cpp example files. There should be some documentation covering the SDK. The only documentation that i found related with is the Intel FPGA AI suite PCIe based design example https://www.intel.com/content/www/us/en/docs/programmable/768977/2024-3/fpga-runtime-plugin.html

From what I understand, the general inference workflow involves the following steps:

Identify the hardware architecture
Deploy the model
Prepare the input data
Send inference requests to the DLA
Retrieve the output data

JohnT_Intel · ‎02-27-2025

Hi Ruben,

From the output log file, I observed that it is different from your code where certain printout is missing.

I do not observed the print out below.

// Flip Vertically

flipHorizontally(processed_output);

std::cout << "Flipped Output Array:" << std::endl;

for (const auto& row : processed_output) {

std::cout << "[ ";

for (const auto& value : row) {

std::cout << value << " ";

}

std::cout << "]" << std::endl;

}

// Group the data into 9 slaves

std::vector<std::vector<int>> grouped_output = groupData(processed_output);

std::cout << "\nGrouped Output Array:" << std::endl;

for (size_t i = 0; i < grouped_output.size(); ++i) {

std::cout << "Group " << i + 1 << ": [ ";

for (const auto& value : grouped_output[i]) {

std::cout << value << " ";

}

std::cout << "]" << std::endl;

}

std::cout << "RIS Resolution: " << output_shape[3] << "-bit." << std::endl;

std::vector<int> flattened_output;

for (const auto& group : grouped_output) {

flattened_output.insert(flattened_output.end(), group.begin(), group.end());

}

// std::vector<uint64_t> groups = prepareData(flattened_output, output_shape[3]);

std::vector<uint64_t> groups = prepareData(grouped_output, output_shape[3]);

std::cout << "Prepared Data for SPI:" << std::endl;

for (size_t i = 0; i < groups.size(); ++i) {

std::cout << "Group " << i << ": 0x"

<< std::hex << groups[i]

<< std::dec << std::endl;

}

const std::string throughput_file_name = "throughput_report.txt";

std::ofstream throughput_file;

throughput_file.open(throughput_file_name);

throughput_file << "Throughput : " << totalFps << " fps" << std::endl;

throughput_file << "Batch Size : " << batchSize << std::endl;

throughput_file << "Graph number : " << exeNetworks.size() << std::endl;

throughput_file << "Num Batches : " << num_batches << std::endl;

throughput_file.close();

// Output Debug Network Info if COREDLA_TEST_DEBUG_NETWORK is set

ReadDebugNetworkInfo(ie);

if (return_code) return return_code;

Thanks.

John Tio

RubenPadial · ‎02-27-2025

Hello @JohnT_Intel ,

I only commented out the // Flip Vertically and // Group the data into 9 slaves sections to avoid excessively extending the log, as they only modify the already retrieved output. You can run the full code if you’d like. Do you have any questions about that?

The raw output is printed in the previous section, as you can see in the log file.

Again, do you have an example or pseudocode for properly handling the inference requests?

JohnT_Intel · ‎02-27-2025

Hi Ruben,

Unfortunately I do not have the setup to this up and I am working with engineering to see how we can implement the flow that you requested.

I am sorry if I am not able expediate this support and I am trying my best to help you resolved the issue.

RubenPadial · ‎02-28-2025

Hello @JohnT_Intel,

Sorry. As you suggested reusing the inference request instead of creating a new one for each inference, I thought the solution was trivial and that the problem was in my implementation or concept.

I look forward to a solution.

I believe the concept of using the DLA is correct in a real application: deploy the accelerator and configure it with the graph, then keep it configured and continuously feed it with new data for inference. Isn't that right? Of course, new inferences must wait for the previous one to finish. Is this correct, or have I misunderstood something about the working principle of the accelerator?

JohnT_Intel · ‎02-28-2025

Hi,

Yes, your understanding is correct where you should be able to continuously feed new data for inference. Your understanding of accelerator is correct.

RubenPadial · ‎03-04-2025

Hello @JohnT_Intel ,

Is there any news about this topic?

I'm using the S2M design in case it is helpful to find an alternative solution based on streaming app.

JohnT_Intel · ‎03-06-2025

Hi Ruben,

Sorry for the delay. If you are using the benchmark source code then you will need to include “wait_all” so the inference is completed before you proceed with new input.

You might want to refer to OpenVINO’s classes instead: https://docs.openvino.ai/2024/openvino-workflow/running-inference/integrate-openvino-with-your-application/inference-request.html

RubenPadial · ‎03-06-2025

Hello @JohnT_Intel ,

The following statement is present in the code I shared with you:

std::cout << "#Debug: 10. waitAll.\n";
// wait the latest inference executions
for (auto& inferRequestsQueue : inferRequestsQueues)
inferRequestsQueue->waitAll();

Is this what you are referring to? It doesn't work. Maybe it is not used correctly. Do you have a pseudocode example?

JohnT_Intel · ‎03-07-2025

Hi,

I think it should be different. You may refer to openvino.AsyncInferQueue — OpenVINO™ documentationCopy to clipboardBack ButtonFilter Button — Version(2024)

You may also refer to for the example that contain the wait_all. Throughput Benchmark Sample — OpenVINO™ documentationCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardBack ButtonFilter Button — Version(2024)

RubenPadial · ‎03-07-2025

Hello @JohnT_Intel,

dla_benchmark is implemented in C++.

The API documentation you shared in the previous comment is for Python. The example that uses wait_all is implemented in Python. There is also an example in C++, but it doesn't use wait_all, waitAll, or any similar function.

In addition, the OpenVINO documentation is available, but the required OpenVINO version for the latest FPGA AI (2024.3) is 2023.3.

JohnT_Intel · ‎03-07-2025

Hi Ruben,

You may also refer to 2023.3 version of document from OpenVINO. The sample design can be use to be run with FPGA.Throughput Benchmark Sample — OpenVINO™ documentationCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboard — Version(2023.3)

It has both Python and C++ sample code.

RubenPadial · ‎03-07-2025

Hello @JohnT_Intel ,

The same. It has a C++ example but no "wait_all" o similar funcion is used on it. Only in the Python example.

it uses:

for (ov::InferRequest& ireq : ireqs) {
ireq.wait();
}

Similar to the code I shared with you.

JohnT_Intel · ‎03-07-2025

Hi Ruben,

I think in C++ it is using below code which is wait()

for (ov::InferRequest& ireq : ireqs) {

ireq.wait();

}

RubenPadial · ‎03-07-2025

Hello @JohnT_Intel ,

As I said, it also included in dla_bechmark as well as the application I shared with you. It doesn't work. Find below the code extracted:

 for (size_t iireq = 0; iireq < nireq; iireq++) {
                            auto inferRequest = inferRequestsQueues.at(net_id)->getIdleRequest();
                            if (!inferRequest) {
                                THROW_IE_EXCEPTION << "No idle Infer Requests!";
                            }
                            
                            if(niter != 0LL){
                                std::cout << "#Debug: 10. Set output blob.\n";
                                for (auto & item : outputInfos.at(net_id)) {
                                    std::string currOutputName = item.first;
                                    auto currOutputBlob = ioBlobs.at(net_id).second[iterations.at(net_id)][currOutputName];
                                    inferRequest->SetBlob(currOutputName, currOutputBlob);
                                }
                                std::cout << "#Debug: 10. Set input blob.\n";

                                for (auto & item: inputInfos.at(net_id)){
                                    std::string currInputName = item.first;
                                    auto currInputBlob = ioBlobs.at(net_id).first[iterations.at(net_id)][currInputName];
                                    inferRequest->SetBlob(currInputName, currInputBlob);
                                }
                            }

                            // Execute one request/batch
                            if (FLAGS_api == "sync") {
                                inferRequest->infer();
                            } else {
                                // As the inference request is currently idle, the wait() adds no additional overhead (and should return immediately).
                                // The primary reason for calling the method is exception checking/re-throwing.
                                // Callback, that governs the actual execution can handle errors as well,
                                // but as it uses just error codes it has no details like ‘what()’ method of `std::exception`
                                // So, rechecking for any exceptions here.
                                inferRequest->wait();
                                inferRequest->startAsync();
                            }

                            iterations.at(net_id) ++;
                            if (net_id == exeNetworks.size() - 1) {
                                execTime = std::chrono::duration_cast<ns>(Time::now() - startTime).count();
                                if (niter > 0) {
                                    progressBar.addProgress(1);
                                } else {
                                    // calculate how many progress intervals are covered by current iteration.
                                    // depends on the current iteration time and time of each progress interval.
                                    // Previously covered progress intervals must be skipped.
                                    auto progressIntervalTime = duration_nanoseconds / progressBarTotalCount;
                                    size_t newProgress = execTime / progressIntervalTime - progressCnt;
                                    progressBar.addProgress(newProgress);
                                    progressCnt += newProgress;
                                }
                            }
                        }

JohnT_Intel · ‎03-10-2025

Hi Ruben,

I think you might need to only provide new input of data and not changing the blob which will think that this is a new inference setting.

During the 1st run, you should have performed all the setting and during the second run onwards, you should just provide the input data.

RubenPadial · ‎03-30-2025

Hello @JohnT_Intel,

Same behaviour.

I changed to create the blobs before the loop and only filling them in the loop:

        // Create blobs only once before the loop
        using Blob_t = std::vector<std::map<std::string, Blob::Ptr>>;
        std::vector<std::pair<Blob_t, Blob_t>> ioBlobs = vectorMapWithIndex<std::pair<Blob_t, Blob_t>>(
            exeNetworks, [&](ExecutableNetwork* const& exeNetwork, uint32_t index) mutable {
                Blob_t inputBlobs;
                Blob_t outputBlobs;
                ConstInputsDataMap inputInfo = exeNetwork->GetInputsInfo();
                ConstOutputsDataMap outputInfo = exeNetwork->GetOutputsInfo();
                
                for (uint32_t batch = 0; batch < num_batches; batch++) {
                    std::map<std::string, Blob::Ptr> outputBlobsMap;
                    for (auto& item : outputInfo) {
                        auto& precision = item.second->getTensorDesc().getPrecision();
                        if (precision != Precision::FP32) {
                            THROW_IE_EXCEPTION << "Output blob creation only supports FP32 precision. Instead got: " + precision;
                        }
                        auto outputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                        outputBlob->allocate();
                        outputBlobsMap[item.first] = (outputBlob);
                    }

                    std::map<std::string, Blob::Ptr> inputBlobsMap;
                    for (auto& item : inputInfo) {
                        Blob::Ptr inputBlob = nullptr;
                        auto& precision = item.second->getTensorDesc().getPrecision();
                        if (precision == Precision::FP32) {
                            inputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                        } else if (precision == Precision::U8) {
                            inputBlob = make_shared_blob<PrecisionTrait<Precision::U8>::value_type>(item.second->getTensorDesc());
                        } else {
                            THROW_IE_EXCEPTION << "Input blob creation only supports FP32 and U8 precision. Instead got: " + precision;
                        }
                        inputBlob->allocate();
                        inputBlobsMap[item.first] = (inputBlob);
                    }

                    inputBlobs.push_back(inputBlobsMap);
                    outputBlobs.push_back(outputBlobsMap);
                }
                
                return std::make_pair(inputBlobs, outputBlobs);
            }
        );

        std::cout << "Blobs initialized once before the loop.\n";

        while (1) {
        ...
          // Fill blobs with new input values (DO NOT re-create them)
          for (size_t i = 0; i < exeNetworks.size(); i++) {
                slog::info << "Filling input blobs for network ( " << topology_names[i] << " )" << slog::endl;
                fillBlobs(inputs, ioBlobs[i].first);  // Only fill the existing blobs
           }
       ...
        }

Error: dlia_infer_request.cpp:53 Number of inference requests exceed the maximum number of inference requests supported per instance

JohnT_Intel · ‎04-01-2025

Hi Ruben,

I think you might need to try out with OpenVINO example design or other runtime example design to see if it is working from your side (eg. classification_sample_async or object_detection_demo)?

RubenPadial · ‎05-13-2025

Hello @JohnT_Intel ,

Both examples work, but they are intended for CPU/GPU. In addition, they collect multiple input images into a batch and request inference for the entire batch just like the benchmark example. The issue is related to FPGA DLA instantiation. I need to request an inference on every input event. For some reason, this creates a new DLA instance each time instead of reusing the existing one. This leads to an error once the number of inferences reaches five. Do you have any suggestions to address this?

JohnT_Intel · ‎05-14-2025

Hi,

May I know how do you run it? Have you run it with FPGA plugin?

RubenPadial · ‎05-14-2025

Hello @JohnT_Intel ,

I used HETERO FPGA plugin

JohnT_Intel · ‎05-14-2025

Hi,

Do you face any error when running HETERO or you are observing that the code that is intended for CPU/GPU not working?

Intel FPGA AI Sutie Inference Engine

Artificial Intelligence

FPGA Interface Manager (FIM)

SW|HDL Development