Application Acceleration With FPGAs
Programmable Acceleration Cards (PACs), DCP, FPGA AI Suite, Software Stack, and Reference Designs
500 Discussions

Intel FPGA AI Sutie Inference Engine

RubenPadial
New Contributor I
2,665 Views

Is there any official documentation on the DLA runtime or inference engine for managing the DLA from the ARM side? I need to develop a custom application for running inference, but so far, I’ve only found the dla_benchmark (main.cpp) and streaming_inference_app.cpp example files. There should be some documentation covering the SDK. The only documentation that i found related with is the Intel FPGA AI suite PCIe based design example https://www.intel.com/content/www/us/en/docs/programmable/768977/2024-3/fpga-runtime-plugin.html

From what I understand, the general inference workflow involves the following steps:

  1. Identify the hardware architecture
  2. Deploy the model
  3. Prepare the input data
  4. Send inference requests to the DLA
  5. Retrieve the output data
0 Kudos
37 Replies
JohnT_Intel
Employee
895 Views

Hi Ruben,


From the output log file, I observed that it is different from your code where certain printout is missing.


I do not observed the print out below.


 // Flip Vertically

      flipHorizontally(processed_output);


      std::cout << "Flipped Output Array:" << std::endl;

      for (const auto& row : processed_output) {

        std::cout << "[ ";

        for (const auto& value : row) {

          std::cout << value << " ";

        }

        std::cout << "]" << std::endl;

      }


      // Group the data into 9 slaves

      std::vector<std::vector<int>> grouped_output = groupData(processed_output);


      std::cout << "\nGrouped Output Array:" << std::endl;

      for (size_t i = 0; i < grouped_output.size(); ++i) {

        std::cout << "Group " << i + 1 << ": [ ";

        for (const auto& value : grouped_output[i]) {

          std::cout << value << " ";

        }

        std::cout << "]" << std::endl;

      }


      std::cout << "RIS Resolution: " << output_shape[3] << "-bit." << std::endl;

      std::vector<int> flattened_output;

      for (const auto& group : grouped_output) {

        flattened_output.insert(flattened_output.end(), group.begin(), group.end());

      }


      // std::vector<uint64_t> groups = prepareData(flattened_output, output_shape[3]);

      std::vector<uint64_t> groups = prepareData(grouped_output, output_shape[3]);

      std::cout << "Prepared Data for SPI:" << std::endl;

      for (size_t i = 0; i < groups.size(); ++i) {

        std::cout << "Group " << i << ": 0x" 

            << std::hex << groups[i] 

            << std::dec << std::endl;

      }



      const std::string throughput_file_name = "throughput_report.txt";

      std::ofstream throughput_file;

      throughput_file.open(throughput_file_name);

      throughput_file << "Throughput : " << totalFps << " fps" << std::endl;

      throughput_file << "Batch Size : " << batchSize << std::endl;

      throughput_file << "Graph number : " << exeNetworks.size() << std::endl;

      throughput_file << "Num Batches : " << num_batches << std::endl;

      throughput_file.close();


      // Output Debug Network Info if COREDLA_TEST_DEBUG_NETWORK is set

      ReadDebugNetworkInfo(ie);

      if (return_code) return return_code;


Thanks.

John Tio


0 Kudos
RubenPadial
New Contributor I
878 Views

Hello @JohnT_Intel ,

I only commented out the // Flip Vertically and // Group the data into 9 slaves sections to avoid excessively extending the log, as they only modify the already retrieved output. You can run the full code if you’d like. Do you have any questions about that?

The raw output is printed in the previous section, as you can see in the log file.

Again, do you have an example or pseudocode for properly handling the inference requests?

0 Kudos
JohnT_Intel
Employee
878 Views

Hi Ruben,


Unfortunately I do not have the setup to this up and I am working with engineering to see how we can implement the flow that you requested.


I am sorry if I am not able expediate this support and I am trying my best to help you resolved the issue.


0 Kudos
RubenPadial
New Contributor I
844 Views

Hello @JohnT_Intel,

Sorry. As you suggested reusing the inference request instead of creating a new one for each inference, I thought the solution was trivial and that the problem was in my implementation or concept.

I look forward to a solution.

I believe the concept of using the DLA is correct in a real application: deploy the accelerator and configure it with the graph, then keep it configured and continuously feed it with new data for inference. Isn't that right? Of course, new inferences must wait for the previous one to finish. Is this correct, or have I misunderstood something about the working principle of the accelerator?

0 Kudos
JohnT_Intel
Employee
841 Views

Hi,


Yes, your understanding is correct where you should be able to continuously feed new data for inference. Your understanding of accelerator is correct.


0 Kudos
RubenPadial
New Contributor I
771 Views

Hello @JohnT_Intel ,

Is there any news about this topic?

I'm using the S2M design in case it is helpful to find an alternative solution based on streaming app.

0 Kudos
JohnT_Intel
Employee
713 Views

Hi Ruben,


Sorry for the delay. If you are using the benchmark source code then you will need to include “wait_all” so the inference is completed before you proceed with new input.


You might want to refer to OpenVINO’s classes instead: https://docs.openvino.ai/2024/openvino-workflow/running-inference/integrate-openvino-with-your-application/inference-request.html


0 Kudos
RubenPadial
New Contributor I
702 Views

Hello @JohnT_Intel ,

 

The following statement is present in the code I shared with you:

std::cout << "#Debug: 10. waitAll.\n";
// wait the latest inference executions
for (auto& inferRequestsQueue : inferRequestsQueues)
inferRequestsQueue->waitAll();

Is this what you are referring to? It doesn't work. Maybe it is not used correctly. Do you have a pseudocode example?

0 Kudos
RubenPadial
New Contributor I
695 Views

Hello @JohnT_Intel,

dla_benchmark is implemented in C++.

The API documentation you shared in the previous comment is for Python. The example that uses wait_all is implemented in Python. There is also an example in C++, but it doesn't use wait_all, waitAll, or any similar function.

In addition, the OpenVINO documentation is available, but the required OpenVINO version for the latest FPGA AI (2024.3) is 2023.3.

0 Kudos
JohnT_Intel
Employee
694 Views
0 Kudos
RubenPadial
New Contributor I
674 Views

Hello @JohnT_Intel ,


The same. It has a C++ example but no "wait_all" o similar funcion is used on it. Only in the Python example.

it uses:

for (ov::InferRequest& ireq : ireqs) {
    ireq.wait();
}

Similar to the code I shared with you.

0 Kudos
JohnT_Intel
Employee
673 Views

Hi Ruben,


I think in C++ it is using below code which is wait()

for (ov::InferRequest& ireq : ireqs) {

ireq.wait();

}



0 Kudos
RubenPadial
New Contributor I
660 Views

Hello @JohnT_Intel ,

As I said, it also included in dla_bechmark as well as the application I shared with you. It doesn't work. Find below the code extracted:

 

 for (size_t iireq = 0; iireq < nireq; iireq++) {
                            auto inferRequest = inferRequestsQueues.at(net_id)->getIdleRequest();
                            if (!inferRequest) {
                                THROW_IE_EXCEPTION << "No idle Infer Requests!";
                            }
                            
                            if(niter != 0LL){
                                std::cout << "#Debug: 10. Set output blob.\n";
                                for (auto & item : outputInfos.at(net_id)) {
                                    std::string currOutputName = item.first;
                                    auto currOutputBlob = ioBlobs.at(net_id).second[iterations.at(net_id)][currOutputName];
                                    inferRequest->SetBlob(currOutputName, currOutputBlob);
                                }
                                std::cout << "#Debug: 10. Set input blob.\n";

                                for (auto & item: inputInfos.at(net_id)){
                                    std::string currInputName = item.first;
                                    auto currInputBlob = ioBlobs.at(net_id).first[iterations.at(net_id)][currInputName];
                                    inferRequest->SetBlob(currInputName, currInputBlob);
                                }
                            }

                            // Execute one request/batch
                            if (FLAGS_api == "sync") {
                                inferRequest->infer();
                            } else {
                                // As the inference request is currently idle, the wait() adds no additional overhead (and should return immediately).
                                // The primary reason for calling the method is exception checking/re-throwing.
                                // Callback, that governs the actual execution can handle errors as well,
                                // but as it uses just error codes it has no details like ‘what()’ method of `std::exception`
                                // So, rechecking for any exceptions here.
                                inferRequest->wait();
                                inferRequest->startAsync();
                            }

                            iterations.at(net_id) ++;
                            if (net_id == exeNetworks.size() - 1) {
                                execTime = std::chrono::duration_cast<ns>(Time::now() - startTime).count();
                                if (niter > 0) {
                                    progressBar.addProgress(1);
                                } else {
                                    // calculate how many progress intervals are covered by current iteration.
                                    // depends on the current iteration time and time of each progress interval.
                                    // Previously covered progress intervals must be skipped.
                                    auto progressIntervalTime = duration_nanoseconds / progressBarTotalCount;
                                    size_t newProgress = execTime / progressIntervalTime - progressCnt;
                                    progressBar.addProgress(newProgress);
                                    progressCnt += newProgress;
                                }
                            }
                        }

 

0 Kudos
JohnT_Intel
Employee
566 Views

Hi Ruben,


I think you might need to only provide new input of data and not changing the blob which will think that this is a new inference setting.


During the 1st run, you should have performed all the setting and during the second run onwards, you should just provide the input data.


0 Kudos
RubenPadial
New Contributor I
394 Views

Hello @JohnT_Intel,

Same behaviour.

I changed to create the blobs before the loop and only filling them in the loop:


 

        // Create blobs only once before the loop
        using Blob_t = std::vector<std::map<std::string, Blob::Ptr>>;
        std::vector<std::pair<Blob_t, Blob_t>> ioBlobs = vectorMapWithIndex<std::pair<Blob_t, Blob_t>>(
            exeNetworks, [&](ExecutableNetwork* const& exeNetwork, uint32_t index) mutable {
                Blob_t inputBlobs;
                Blob_t outputBlobs;
                ConstInputsDataMap inputInfo = exeNetwork->GetInputsInfo();
                ConstOutputsDataMap outputInfo = exeNetwork->GetOutputsInfo();
                
                for (uint32_t batch = 0; batch < num_batches; batch++) {
                    std::map<std::string, Blob::Ptr> outputBlobsMap;
                    for (auto& item : outputInfo) {
                        auto& precision = item.second->getTensorDesc().getPrecision();
                        if (precision != Precision::FP32) {
                            THROW_IE_EXCEPTION << "Output blob creation only supports FP32 precision. Instead got: " + precision;
                        }
                        auto outputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                        outputBlob->allocate();
                        outputBlobsMap[item.first] = (outputBlob);
                    }

                    std::map<std::string, Blob::Ptr> inputBlobsMap;
                    for (auto& item : inputInfo) {
                        Blob::Ptr inputBlob = nullptr;
                        auto& precision = item.second->getTensorDesc().getPrecision();
                        if (precision == Precision::FP32) {
                            inputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                        } else if (precision == Precision::U8) {
                            inputBlob = make_shared_blob<PrecisionTrait<Precision::U8>::value_type>(item.second->getTensorDesc());
                        } else {
                            THROW_IE_EXCEPTION << "Input blob creation only supports FP32 and U8 precision. Instead got: " + precision;
                        }
                        inputBlob->allocate();
                        inputBlobsMap[item.first] = (inputBlob);
                    }

                    inputBlobs.push_back(inputBlobsMap);
                    outputBlobs.push_back(outputBlobsMap);
                }
                
                return std::make_pair(inputBlobs, outputBlobs);
            }
        );

        std::cout << "Blobs initialized once before the loop.\n";

        while (1) {
        ...
          // Fill blobs with new input values (DO NOT re-create them)
          for (size_t i = 0; i < exeNetworks.size(); i++) {
                slog::info << "Filling input blobs for network ( " << topology_names[i] << " )" << slog::endl;
                fillBlobs(inputs, ioBlobs[i].first);  // Only fill the existing blobs
           }
       ...
        }

 Error: dlia_infer_request.cpp:53 Number of inference requests exceed the maximum number of inference requests supported per instance 

0 Kudos
JohnT_Intel
Employee
352 Views

Hi Ruben,


I think you might need to try out with OpenVINO example design or other runtime example design to see if it is working from your side (eg. classification_sample_async or object_detection_demo)?


0 Kudos
Reply