Re:About running FPGA AI Suite Demo on SoC

STATEABC · ‎09-04-2025

Hello, I'm trying to run the object detection demo and segmentation_demo in the FPGA AI Suite Runtime using the SoC examples described in https://www.intel.com/content/www/us/en/docs/programmable/848957/2025-1/design-examples-user-guide.html, but I'm encountering some issues.

First, I set up the environment according to the design example user guide and was able to run the M2M Mode Demonstration Application and the S2M Mode Demonstration Application on the SoC FPGA.

Then, I modified the Runtime's create_hps_image.sh and CMakeLists files, adding the object detection demo and segmentation_demo related code to the build hps dla runtime function. Finally, I generated the ed4/root/home/app file. On a SOC FPGA, I specify -plugins, -d, -arch_file, and -m person-vehicle-bike-detection-2000.xml. I can run the object detection demo and segmentation_demo using the -d CPU configuration. However, if I switch to the -d HETERO:FPGA,CPU configuration, I get the following error: src/inference/src/ie common.cpp:71 [ GENERAL ERROR ] Exception from /opt/intel/pga ai suite 2024.3/dla/util/hetero plugin/compiled model.cpp:62:
Standard exception from compilation library:
src/inference/src/ie common.cpp:71 [ GENERAL ERROR ] DLIAPlugin::Engine::query_model() not supported for AOT-only builds.

My understanding is that HETERO calls the plugin's query_model() function to perform layer segmentation at runtime, but AOT-only builds... The plugin doesn't implement query_model(), so person-vehicle-bike-detection-2000.xml can't be used directly. However, compiling person-vehicle-bike-detection-2000.xml with dla_compiler only generates a .bin file, which can't be used with the demo.

I'm wondering how to fix this?

Zulkifli_Intel · ‎09-04-2025

Hi STATEABC,

Thank you for reaching out.

It seems like you compiled person-vehicle-bike-detection-2000.xml without CPU fallback enabled. The generated blob is pure FPGA-only, so when you run -d HETERO:FPGA,CPU, the runtime still calls query_model() and hits the AOT limitation. According to the user guide, the runtime inference on the SoC FPGA device utilizes the OpenVINO Arm CPU plugin. To enable fallback to the OpenVINO Arm CPU plugin for graph layers that are not supported on the FPGA, the device flag must be set to HETERO:FPGA,CPU during the AOT compile step.

Can you share the output when running this command:

dla_compiler -h

Regards,

Zul

STATEABC · ‎09-04-2025

Thanks! That makes sense. I had compiled the IR without CPU fallback and then tried to run -d HETERO:FPGA,CPU at runtime, which triggered query_model() on the AOT-only plugin.

Here is the output when I run dla_compiler --help：

dla_compiler --help
Allowed options:
--help produce help message
--network-file arg Inference engine formatted network
definition file to load
--batch-size arg (=1) Optional batch size override (default =
1)
--o arg Output file name(s) for exported
file(s). Required if using
foutput-format=open_vino_hetero
--dumpdir arg Specified path(s) to folder for export
and debug file outputs
--plugins-file arg (=/opt/intel/fpga_ai_suite_2024.3/dla/bin/plugins.xml)
Path to plugins.xml (warning - expert
use only)
--fno-transform-tables Do not dump input/output transform
tables.
--foutput-format arg Select the format of the exported
compiled graph. Valid
options:open_vino_hetero,
dla_compiled_result
--fplugin arg (=HETERO:FPGA,CPU) Select between different OpenVino
plugins. Valid options:HETERO:FPGA,
HETERO:CPU, HETERO:FPGA,CPU
--ffolding-option arg (=1) Controls how folding of the first
convolution is performed
0: No folding.
1: Width and height folding performed
externally of the DLA IP.
2: Width and height folding performed
externally of the DLA IP, additional
folding performed by the DLA IP if
possible.
3: All folding performed by the DLA IP
if possible.

--ffold-preprocessing Folds the mean subtraction and scaling
required for preprocessing into the
first convolution of the network. Note:
only supported in designs with hardware
layout transforms.
--march arg (=Generic.arch) Architecture to use when compiling the
graph.
--mmax-resources arg Number of available ALMs, M20Ks and
DSPs on the board
--mmax-resources-alm-util arg (=100) Sets the utilization of ALMs as a
percentage of the chip total
--network-weightings arg The weighting used for each network in
determining the overall throughput
--fmin-subgraph-layers arg (=2) Minimum number of layers allowed in a
subgraph that runs on FPGA. Subgraph
with fewer layers than this value will
run on CPU in Hetero plugin. Must be >=
1
--fanalyze-performance Estimates the performance of all
provided graphs on the specified
architecture
--fdump-performance-report [=arg(=performance-report.txt)]
Optional output file for the
performance estimate
--fanalyze-area Estimates the area of the specified
architecture
--fdump-area-report [=arg(=area-report.txt)]
Optional output file for the area
estimate
--fdump-ptc-report [=arg(=power-parameters.ptc)]
Optional output file for the PTC report
file
--fassumed-fmax-core arg (=0) Sets the estimated fmax of the IP Core
(default is family-dependent)
--fdisplay-device Display the device for each subgraph
for all networks
--gen-arch Performs architecture generation
--gen-arch-file arg (=./generated_arch.arch)
Location to write the generated arch to
--gen-min-sb arg (=2048) Sets the min stream buffer size
--mtarget-fps arg (=max) Sets the target fps when generating
archs, fps value can be a decimal or
MAX(default).
If value is set to a decimal number,
architecture generator will target that
fps and minimize resources, subject to
-mmax-resources.
If value is set to "MAX", architecture
generator will maximize fps, subject to
-mmax-resources.
--interleave-search Enable interleave search to find best
interleave for the device and graph.
--max-archsetM-percentage arg (=50) Sets the maximum percentage of archset
M of total testing architectures
(exhaustive search mode only)
--arch-limit arg (=10000) Sets the maximum number of architecture
for best architecture searching
(exhaustive search mode only, default
10000)
--time-limit arg (=14400) Sets the time limit in seconds for best
architecture searching (exhaustive
search mode only, default 4 hours)
--bin-data Read input as binary data regardless of
shape.
--enable-early-access Enable early access (EA) features of
FPGA AI Suite. These features are still
underdevelopment and have
flaws/limitations.
Consult the FPGA AI Suite documentation
for more information on these features.
--version Returns build version and os.

EXAMPLES

These examples assume that $COREDLA_WORK/demo/models/public/resnet-50-tf/FP32/model.yml has been created using OpenVINO

Compile a graph into an AOT file and estimate its performance, assuming a model.yml has been created:

dla_compiler --march ./example_architecture/A10_Generic.arch \
--network-file $COREDLA_WORK/demo/models/public/resnet-50-tf/FP32/model.yml --batch 2 \
--foutput-format open_vino_hetero \
--o ./resnet50A10Perf.bin --dumpdir ./performance-dump

Estimate the performance of a graph on an architecture:

dla_compiler --fanalyze-performance \
--march ./example_architecture/A10_Generic.arch \
--network-file $COREDLA_WORK/demo/models/public/resnet-50-tf/FP32/model.yml \
--foutput-format open_vino_hetero --o ./resnet50A10Perf.bin \
--fdump-performance-report performance-report.txt

Estimate the area of an architecture:

dla_compiler --fanalyze-area \
--march ./example_architecture/A10_Generic.arch \
--fdump-area-report ./area-report.txt

Generate an architecture for highest performance (highest throughput in frames per second (fps)):

dla_compiler --gen-arch --mmax-resources=427200,2713,1518 --gen-min-sb=2048 \
--network-file $COREDLA_WORK/demo/models/public/resnet-50-tf/FP32/model.yml \
--march=./example_architecture/A10_Performance.arch \
--mmax-resources-alm-util=75

STATEABC · ‎09-04-2025

Furthermore if I execute dla_compiler --march $COREDLA_ROOT/example_architectures/AGX7_Performance.arch --network-file ./person-vehicle-bike-detection-2000.xml --plugins-file $COREDLA_ROOT/bin/plugins.xml --fplugin HETERO:FPGA,CPU --foutput-format open_vino_hetero --o ./result/test.bin --dumpdir ./result will have the following output:
network-weightings not specified. Auto assigning network weights to 1.0
Architecture set to /opt/intel/fpga_ai_suite_2024.3/dla/example_architectures/AGX7_Performance.arch
Network path set to ./person-vehicle-bike-detection-2000.xml
Read graph for torch-jit-export complete with network weight: 1.
Generating unsupported layer chains graph (./result/unsupported_layer_chains.dot)
Starting compilation. CoreDLA compiler logs and visualizations will be exported in: ./result/
Finished compilation in 2347 ms
Generating Model Analyzer report (./result/model_analyzer_report.txt)

INFO: Arch file enables the sigmoid activation module, however, sigmoid activations are not used in the model.
To avoid allocating unnecessary resources, disable the sigmoid activation module in the arch file.

INFO: Arch file enables the PReLU activation module, however, PReLU activations are not used in the model.
To avoid allocating unnecessary resources, disable the PReLU activation module in the arch file.

INFO: Arch file enables the pool module, however, pool is not used in the model.
To avoid allocating unnecessary resources, disable the pool module in the arch file.

INFO: Arch file enables the softmax module, However, softmax is not used in the model.
To avoid allocating unnecessary resources, disable the softmax module in the arch file.

[ INFO ] The input graph is split into two subgraphs, one for CPU and one for FPGA.

Exporting input transform to file
Exporting output transform to file

However, the generated .bin file cannot be used with the -m parameter of object_detection_demo.

Zulkifli_Intel · ‎09-06-2025

Hi STATEABC,

Thank you for sharing the output.

Can you run the dla_compiler with these options:

dla_compiler \

--network-file person-vehicle-bike-detection-2000.xml \

--march <arch.json> \

--foutput-format open_vino_hetero \

--fplugin HETERO:FPGA,CPU \

--dumpdir path_to_export

Then run the demo:

./object_detection_demo \

-m export_path/person-vehicle-bike-detection-2000.xml \

-d HETERO:FPGA,CPU \

-arch_file <arch.json> \

-plugins /opt/intel/fpga_ai_suite_2024.3/dla/bin/plugins.xml

Regards,

Zul

STATEABC · ‎09-07-2025

Hello, perhaps you didn't notice the output I posted. When I run the command `dla_compiler --march $COREDLA_ROOT/example_architectures/AGX7_Performance.arch --network-file ./person-vehicle-bike-detection-2000.xml --plugins-file $COREDLA_ROOT/bin/plugins.xml --fplugin HETERO:FPGA,CPU --foutput-format open_vino_hetero --o ./result/test.bin --dumpdir ./result`, it generates files as shown in the figure ，It won't generate the.xml file I need, and there will be the following print

network-weightings not specified. Auto assigning network weights to 1.0
Architecture set to /opt/intel/fpga_ai_suite_2024.3/dla/example_architectures/AGX7_Performance.arch
Network path set to ./person-vehicle-bike-detection-2000.xml
Read graph for torch-jit-export complete with network weight: 1.
Generating unsupported layer chains graph (./result/unsupported_layer_chains.dot)
Starting compilation. CoreDLA compiler logs and visualizations will be exported in: ./result/
Finished compilation in 2347 ms
Generating Model Analyzer report (./result/model_analyzer_report.txt)

INFO: Arch file enables the sigmoid activation module, however, sigmoid activations are not used in the model.
To avoid allocating unnecessary resources, disable the sigmoid activation module in the arch file.

INFO: Arch file enables the PReLU activation module, however, PReLU activations are not used in the model.
To avoid allocating unnecessary resources, disable the PReLU activation module in the arch file.

INFO: Arch file enables the pool module, however, pool is not used in the model.
To avoid allocating unnecessary resources, disable the pool module in the arch file.

INFO: Arch file enables the softmax module, However, softmax is not used in the model.
To avoid allocating unnecessary resources, disable the softmax module in the arch file.

[ INFO ] The input graph is split into two subgraphs, one for CPU and one for FPGA.

Exporting input transform to file
Exporting output transform to file

In addition, if I run the command ：

./object_detection_demo \

-d HETERO:FPGA,CPU \

-i ./car-detection.png \
-m ./person-vehicle-bike-detection-2000.xml \
-at ssd \
-arch_file ./AGX7_Performance.arch \
-plugins ./plugins.xml \
labels ./voc_20cl_bkgr.txt,

the initial error message is obtained: `src/inference/src/ie common.cpp:71 [GENERAL ERROR] Exception from /opt/intel/pga ai suite 2024.3/dla/util/hetero plugin/compiled model.cpp:62:` Standard exception from compilation library:
src/inference/src/ie common.cpp:71 [ GENERAL ERROR ] DLIAPlugin::Engine::query_model() not supported for AOT-only builds.

Zulkifli_Intel · ‎09-08-2025

Hi STATEABC,

I apologize for the oversight. I'll check with the development team on this particular issue and let you know once I receive feedback.

Regards,

Zul

JohnT_Intel · ‎09-17-2025

Hi,

It looks like the application you are performing is for JIT application. Since you are running SOC design, it only support AOT graph only. You will need to modify the code so that it is running in AOT graph.

Thanks