Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6582 Discussions

Getting low througput in GPU compared to CPU

BCPRAVEEN1234
Beginner
1,208 Views

@openvino  @Peh_Intel  @Hari_B_Intel 

I converted a custom GRU model (trained on the IMDB dataset) to OpenVINO IR (.xml + .bin) and ran benchmark_app on CPU, GPU, and HETERO:CPU,GPU. The CPU shows much higher throughput than the GPU. Is this expected or is there something wrong with my model conversion/design or benchmark_app settings? I’ve attached screenshots of the results.

What I did

  1. Trained a custom GRU model on the IMDB dataset (PyTorch ).

  2. Converted the model to OpenVINO IR (.xml + .bin) using the Model Optimizer.

  3. Verified performance with OpenVINO benchmark_app on:

    • CPU

    • GPU

    • HETERO:CPU,GPU

  4. Observed significantly higher throughput on CPU compared to GPU. I also tried different device combos but the behavior persists

    Environment 

    • OpenVINO version: 2024.6

    • OS: WINDOWS11

    • CPU: 12th gen Intel® Core™ i7-12700

    • GPU:  Intel® UHD Graphics770 

    • Model framework & conversion: [e.g., PyTorch -> ONNX -> OVC

    • IR files: .xml and .bin generated by Model Optimizer.

    • benchmark_app command used:
    • !benchmark_app -m D:openvino\gru.xml -d GPU -b 1 -i:d:openvino\inputs_bs1 --api asyncBCPRAVEEN1234_0-1760942108340.pngBCPRAVEEN1234_1-1760942142270.pngBCPRAVEEN1234_2-1760942172985.png

       

       

       
0 Kudos
10 Replies
Peh_Intel
Moderator
1,166 Views

Hi BCPRAVEEN1234,


Please try to add the following Benchmark parameter:

-nireq 4 -inference_only True



Regards,

Peh


0 Kudos
BCPRAVEEN1234
Beginner
1,130 Views

hello sir,

i tried the same command but still no change in the throughput sir i tried this command 

!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d GPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async -nireq 4 -inference_only True

BCPRAVEEN1234_0-1761021419449.png

for this command i am getting similar value no change in the result sir

!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d GPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async

BCPRAVEEN1234_1-1761021504009.png

 

0 Kudos
Peh_Intel
Moderator
1,128 Views

Hi BCPRAVEEN1234,


How's about the results on CPU with the Benchmark parameter?


Anyhow, could you share your model and input images as well?



Regards,

Peh


0 Kudos
BCPRAVEEN1234
Beginner
1,125 Views

i am using text data sir my model is customized GRU model with imdb dataset 

this is the command i am using for CPU here i am giving batch size is 2

!benchmark_app -m "D:\praveen\new_gru\gru_wokr_model_32.xml" -d CPU -b 2 -i D:\praveen\new_gru\inputs_bs2 --api async 

here the cpu results 

BCPRAVEEN1234_0-1761022353377.png

 

0 Kudos
Peh_Intel
Moderator
1,104 Views

Hi BCPRAVEEN1234,


From your screenshot, there are 18 inference requests assigned when inferencing on CPU while only 4 inference requests assigned for GPU. Please make sure both device having same inference requests for the testing.


Could you compress your model (.xml and .bin) and also your text data into a zip file and share with me for further troubleshooting?



Regards,

Peh


0 Kudos
BCPRAVEEN1234
Beginner
1,101 Views

i will share my model .xml and .bin file for fp32 as well as fp16 and input data for model the fp32 .bin file itself 51 mb is their here it is not supporting sir what i will do now i will share reaming files here sir please check once sir i shared the files sir

0 Kudos
Hari_B_Intel
Moderator
1,088 Views

Hi @BCPRAVEEN1234 

 

Thank you for sharing the detailed information and benchmark results. Based on your logs, this behavior is expected for several reasons, especially when using Intel® UHD Graphics 770 (integrated GPU) with a GRU model.

 

1. CPU vs. GPU performance is expected behavior - GRU/RNN-type networks are sequential in nature and generally not optimized for GPU execution, especially on integrated GPUs like UHD 770. The Intel CPU plugin (especially with the latest oneDNN optimizations) can process these operations more efficiently — hence the much higher throughput on CPU. GPUs shine for parallelizable workloads (e.g., CNNs, Transformers, image models), but RNNs have less parallelism to exploit.

 

2. GPU performance - The ~35 FPS on GPU and ~2000 FPS on CPU in your results are consistent with what we see for similar RNN workloads. UHD Graphics 770 is designed more for light AI workloads and visualization — so limited performance on deep learning inference is expected.

 

Some suggestion that might helps

Try enabling -hint performance in benchmark_app for better auto-tuning:

benchmark_app -m C:\openvino\gru.xml -d GPU -b 1 -i C:\openvino\inputs_bs1 --api async -hint performance

 

For best performance on RNN/GRU/LSTM, we generally recommend CPU execution or using a discrete GPU (Arc or Xe MAX) for more significant gains.

 

Hope this informations help

 

Thank you 

 

0 Kudos
BCPRAVEEN1234
Beginner
1,086 Views

so based on your conversation transformer models will gives the better performance on this discreate GPU right?? can you clarify me is their  any results for NLP tasks on this iGPU performance compared to CPU

0 Kudos
Peh_Intel
Moderator
670 Views

Hi BCPRAVEEN1234,

 

For your information, we have selected benchmark results for the Intel® Distribution of OpenVINO™ toolkit and OpenVINO Model Server, for a representative selection of public neural networks and Intel® devices.

 

You can refer to this Performance Benchmarks.

 

 

Regards,

Peh

 

0 Kudos
Peh_Intel
Moderator
275 Views

Hi BCPRAVEEN1234,

 

This thread will no longer be monitored since we have provided answers. If you need any additional information from Intel, please submit a new question. 

 

 

Regards,

Peh


0 Kudos
Reply