- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am currently compiling a model on CPU, GPU, and NPU (Llama3). When I monitor my memory usage, I notice that there are wildly different memory usages between these devices.
On CPU, my memory usage can go up to 35 GB, and during inference it also stays at this level.
For GPU, my memory usage does not go up at all.
For my NPU, compilation exceeds memory usage, unfortunately (>96 GB).
I suspect that compiling on the GPU is utilizing some sort of cache - however, I am configuring anything specifically. I am using the most basic setup of core.read_model -> core.compile_model.
Can I get some elaboration on how the GPU is able to use such little memory / what is happening behind the hood, and is this possible for compilation on CPU and NPU? My ultimate objective is to be able to compile my model on the NPU.
Thank you!
ayf7
Details
- Intel Meteor Lake chipset
- OpenVINO 2024.2
- Ubuntu 22.04 LTS
- Linux kernel 6.9.3
- Tags:
- compilation
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ayf7,
For your information, as of now, Large Language models (LLMs) are not supported by the Intel NPU plugin.
Looking at the Intel® Distribution of OpenVINO™ Toolkit Release Notes 2024.2, in the section GPU Device Plugin, the memory usage of LLMs with GPU plugin has been reduced due to the improvement of the both first token and average token latency of LLMs.
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Peh,
Thanks for the response. Just for clarification, I'm working on a project towards supporting LLMs on the Intel NPU plugin via my own implementation of the KV-cache in PyTorch, and converting into OpenVINO IR. I've created an OpenVINO IR that is completely static tensor shapes and therefore should work on NPU - however, I'm running into memory usage bottlenecks.
I've tested compiling my model with varying numbers of transformer layers - the memory usage at compile time for CPU and NPU are both linear w.r.t. the number of operations in my model, however my GPU memory usage is constant and does not change, even if I double the number of layers being used. I'm wondering why this may be the case?
Best,
ayf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ayf7,
As such, could you provide your OpenVINO IR files with us for further investigation?
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ayf7,
Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.
Regards,
Peh
![](/skins/images/8B6E2C8F64F54CBD7F7262AA46F575DA/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page