Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6467 Discussions

Memory usage during compilation of different devices

ayf7
Novice
425 Views

I am currently compiling a model on CPU, GPU, and NPU (Llama3). When I monitor my memory usage, I notice that there are wildly different memory usages between these devices.

On CPU, my memory usage can go up to 35 GB, and during inference it also stays at this level.

For GPU, my memory usage does not go up at all.

For my NPU, compilation exceeds memory usage, unfortunately (>96 GB).

 

I suspect that compiling on the GPU is utilizing some sort of cache - however, I am configuring anything specifically. I am using the most basic setup of core.read_model -> core.compile_model.

Can I get some elaboration on how the GPU is able to use such little memory / what is happening behind the hood, and is this possible for compilation on CPU and NPU? My ultimate objective is to be able to compile my model on the NPU.

 

Thank you!

ayf7

 

Details

  • Intel Meteor Lake chipset
  • OpenVINO 2024.2
  • Ubuntu 22.04 LTS
  • Linux kernel 6.9.3
Labels (2)
0 Kudos
4 Replies
Peh_Intel
Moderator
320 Views

Hi ayf7,


For your information, as of now, Large Language models (LLMs) are not supported by the Intel NPU plugin.


Looking at the Intel® Distribution of OpenVINO™ Toolkit Release Notes 2024.2, in the section GPU Device Plugin, the memory usage of LLMs with GPU plugin has been reduced due to the improvement of the both first token and average token latency of LLMs.



Regards,

Peh


0 Kudos
ayf7
Novice
301 Views

Hi Peh,

 

Thanks for the response. Just for clarification, I'm working on a project towards supporting LLMs on the Intel NPU plugin via my own implementation of the KV-cache in PyTorch, and converting into OpenVINO IR. I've created an OpenVINO IR that is completely static tensor shapes and therefore should work on NPU - however, I'm running into memory usage bottlenecks.

I've tested compiling my model with varying numbers of transformer layers - the memory usage at compile time for CPU and NPU are both linear w.r.t. the number of operations in my model, however my GPU memory usage is constant and does not change, even if I double the number of layers being used. I'm wondering why this may be the case?

 

Best,

ayf

0 Kudos
Peh_Intel
Moderator
264 Views

Hi ayf7,


As such, could you provide your OpenVINO IR files with us for further investigation?



Regards,

Peh


0 Kudos
Peh_Intel
Moderator
90 Views

Hi ayf7,


Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.



Regards,

Peh


0 Kudos
Reply