Getting low througput in GPU compared to CPU

BCPRAVEEN1234 · ‎10-19-2025

I converted a custom GRU model (trained on the IMDB dataset) to OpenVINO IR (.xml + .bin) and ran benchmark_app on CPU, GPU, and HETERO:CPU,GPU. The CPU shows much higher throughput than the GPU. Is this expected or is there something wrong with my model conversion/design or benchmark_app settings? I’ve attached screenshots of the results.

What I did

Trained a custom GRU model on the IMDB dataset (PyTorch ).
Converted the model to OpenVINO IR (.xml + .bin) using the Model Optimizer.
Verified performance with OpenVINO benchmark_app on:
- CPU
- GPU
- HETERO:CPU,GPU
Observed significantly higher throughput on CPU compared to GPU. I also tried different device combos but the behavior persists
Environment
- OpenVINO version: 2024.6
- OS: WINDOWS11
- CPU: 12th gen Intel® Core™ i7-12700
- GPU: Intel® UHD Graphics770
- Model framework & conversion: [e.g., PyTorch -> ONNX -> OVC
- IR files: .xml and .bin generated by Model Optimizer.
- benchmark_app command used:
- !benchmark_app -m D:openvino\gru.xml -d GPU -b 1 -i:d:openvino\inputs_bs1 --api async

Peh_Intel · ‎10-20-2025

Hi BCPRAVEEN1234,

Please try to add the following Benchmark parameter:

-nireq 4 -inference_only True

Regards,

Peh

BCPRAVEEN1234 · ‎10-20-2025

hello sir,

i tried the same command but still no change in the throughput sir i tried this command

!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d GPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async -nireq 4 -inference_only True

for this command i am getting similar value no change in the result sir

!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d GPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async

Peh_Intel · ‎10-20-2025

Hi BCPRAVEEN1234,

How's about the results on CPU with the Benchmark parameter?

Anyhow, could you share your model and input images as well?

Regards,

Peh

BCPRAVEEN1234 · ‎10-20-2025

i am using text data sir my model is customized GRU model with imdb dataset

this is the command i am using for CPU here i am giving batch size is 2

!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d CPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async

here the cpu results

Peh_Intel · ‎10-20-2025

Hi BCPRAVEEN1234,

From your screenshot, there are 18 inference requests assigned when inferencing on CPU while only 4 inference requests assigned for GPU. Please make sure both device having same inference requests for the testing.

Could you compress your model (.xml and .bin) and also your text data into a zip file and share with me for further troubleshooting?

Regards,

Peh

BCPRAVEEN1234 · ‎10-20-2025

i will share my model .xml and .bin file for fp32 as well as fp16 and input data for model the fp32 .bin file itself 51 mb is their here it is not supporting sir what i will do now i will share reaming files here sir please check once sir i shared the files sir

Hari_B_Intel · ‎10-20-2025

Hi @BCPRAVEEN1234

Thank you for sharing the detailed information and benchmark results. Based on your logs, this behavior is expected for several reasons, especially when using Intel® UHD Graphics 770 (integrated GPU) with a GRU model.

1. CPU vs. GPU performance is expected behavior - GRU/RNN-type networks are sequential in nature and generally not optimized for GPU execution, especially on integrated GPUs like UHD 770. The Intel CPU plugin (especially with the latest oneDNN optimizations) can process these operations more efficiently — hence the much higher throughput on CPU. GPUs shine for parallelizable workloads (e.g., CNNs, Transformers, image models), but RNNs have less parallelism to exploit.

2. GPU performance - The ~35 FPS on GPU and ~2000 FPS on CPU in your results are consistent with what we see for similar RNN workloads. UHD Graphics 770 is designed more for light AI workloads and visualization — so limited performance on deep learning inference is expected.

Some suggestion that might helps

Try enabling -hint performance in benchmark_app for better auto-tuning:

benchmark_app -m C:\openvino\gru.xml -d GPU -b 1 -i C:\openvino\inputs_bs1 --api async -hint performance

For best performance on RNN/GRU/LSTM, we generally recommend CPU execution or using a discrete GPU (Arc or Xe MAX) for more significant gains.

Hope this informations help

Thank you

BCPRAVEEN1234 · ‎10-20-2025

so based on your conversation transformer models will gives the better performance on this discreate GPU right?? can you clarify me is their any results for NLP tasks on this iGPU performance compared to CPU

Peh_Intel · ‎10-22-2025

Hi BCPRAVEEN1234,

For your information, we have selected benchmark results for the Intel® Distribution of OpenVINO™ toolkit and OpenVINO Model Server, for a representative selection of public neural networks and Intel® devices.

You can refer to this Performance Benchmarks.

Regards,

Peh

Peh_Intel · ‎10-29-2025

Hi BCPRAVEEN1234,

This thread will no longer be monitored since we have provided answers. If you need any additional information from Intel, please submit a new question.

Regards,

Peh

Getting low througput in GPU compared to CPU

What I did

Environment