- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@openvino @Peh_Intel @Hari_B_Intel
I converted a custom GRU model (trained on the IMDB dataset) to OpenVINO IR (.xml + .bin) and ran benchmark_app on CPU, GPU, and HETERO:CPU,GPU. The CPU shows much higher throughput than the GPU. Is this expected or is there something wrong with my model conversion/design or benchmark_app settings? I’ve attached screenshots of the results.
What I did
Trained a custom GRU model on the IMDB dataset (PyTorch ).
Converted the model to OpenVINO IR (.xml + .bin) using the Model Optimizer.
Verified performance with OpenVINO benchmark_app on:
CPU
GPU
HETERO:CPU,GPU
Observed significantly higher throughput on CPU compared to GPU. I also tried different device combos but the behavior persists
Environment
OpenVINO version: 2024.6
OS: WINDOWS11
CPU: 12th gen Intel® Core™ i7-12700
GPU: Intel® UHD Graphics770
Model framework & conversion: [e.g., PyTorch -> ONNX -> OVC
IR files: .xml and .bin generated by Model Optimizer.
- benchmark_app command used:
- !benchmark_app -m D:openvino\gru.xml -d GPU -b 1 -i:d:openvino\inputs_bs1 --api async
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi BCPRAVEEN1234,
Please try to add the following Benchmark parameter:
-nireq 4 -inference_only True
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hello sir,
i tried the same command but still no change in the throughput sir i tried this command
!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d GPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async -nireq 4 -inference_only True
for this command i am getting similar value no change in the result sir
!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d GPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi BCPRAVEEN1234,
How's about the results on CPU with the Benchmark parameter?
Anyhow, could you share your model and input images as well?
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i am using text data sir my model is customized GRU model with imdb dataset
this is the command i am using for CPU here i am giving batch size is 2
!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d CPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async
here the cpu results
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi BCPRAVEEN1234,
From your screenshot, there are 18 inference requests assigned when inferencing on CPU while only 4 inference requests assigned for GPU. Please make sure both device having same inference requests for the testing.
Could you compress your model (.xml and .bin) and also your text data into a zip file and share with me for further troubleshooting?
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i will share my model .xml and .bin file for fp32 as well as fp16 and input data for model the fp32 .bin file itself 51 mb is their here it is not supporting sir what i will do now i will share reaming files here sir please check once sir i shared the files sir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for sharing the detailed information and benchmark results. Based on your logs, this behavior is expected for several reasons, especially when using Intel® UHD Graphics 770 (integrated GPU) with a GRU model.
1. CPU vs. GPU performance is expected behavior - GRU/RNN-type networks are sequential in nature and generally not optimized for GPU execution, especially on integrated GPUs like UHD 770. The Intel CPU plugin (especially with the latest oneDNN optimizations) can process these operations more efficiently — hence the much higher throughput on CPU. GPUs shine for parallelizable workloads (e.g., CNNs, Transformers, image models), but RNNs have less parallelism to exploit.
2. GPU performance - The ~35 FPS on GPU and ~2000 FPS on CPU in your results are consistent with what we see for similar RNN workloads. UHD Graphics 770 is designed more for light AI workloads and visualization — so limited performance on deep learning inference is expected.
Some suggestion that might helps
Try enabling -hint performance in benchmark_app for better auto-tuning:
benchmark_app -m C:\openvino\gru.xml -d GPU -b 1 -i C:\openvino\inputs_bs1 --api async -hint performance
For best performance on RNN/GRU/LSTM, we generally recommend CPU execution or using a discrete GPU (Arc or Xe MAX) for more significant gains.
Hope this informations help
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so based on your conversation transformer models will gives the better performance on this discreate GPU right?? can you clarify me is their any results for NLP tasks on this iGPU performance compared to CPU
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi BCPRAVEEN1234,
For your information, we have selected benchmark results for the Intel® Distribution of OpenVINO™ toolkit and OpenVINO Model Server, for a representative selection of public neural networks and Intel® devices.
You can refer to this Performance Benchmarks.
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi BCPRAVEEN1234,
This thread will no longer be monitored since we have provided answers. If you need any additional information from Intel, please submit a new question.
Regards,
Peh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page