Memory usage during compilation of different devices

ayf7 · ‎07-07-2024

I am currently compiling a model on CPU, GPU, and NPU (Llama3). When I monitor my memory usage, I notice that there are wildly different memory usages between these devices.

On CPU, my memory usage can go up to 35 GB, and during inference it also stays at this level.

For GPU, my memory usage does not go up at all.

For my NPU, compilation exceeds memory usage, unfortunately (>96 GB).

I suspect that compiling on the GPU is utilizing some sort of cache - however, I am configuring anything specifically. I am using the most basic setup of core.read_model -> core.compile_model.

Can I get some elaboration on how the GPU is able to use such little memory / what is happening behind the hood, and is this possible for compilation on CPU and NPU? My ultimate objective is to be able to compile my model on the NPU.

Thank you!

ayf7

Details

Intel Meteor Lake chipset
OpenVINO 2024.2
Ubuntu 22.04 LTS
Linux kernel 6.9.3

Peh_Intel · ‎07-08-2024

Hi ayf7,

For your information, as of now, Large Language models (LLMs) are not supported by the Intel NPU plugin.

Looking at the Intel® Distribution of OpenVINO™ Toolkit Release Notes 2024.2, in the section GPU Device Plugin, the memory usage of LLMs with GPU plugin has been reduced due to the improvement of the both first token and average token latency of LLMs.

Regards,

Peh

ayf7 · ‎07-09-2024

Hi Peh,

Thanks for the response. Just for clarification, I'm working on a project towards supporting LLMs on the Intel NPU plugin via my own implementation of the KV-cache in PyTorch, and converting into OpenVINO IR. I've created an OpenVINO IR that is completely static tensor shapes and therefore should work on NPU - however, I'm running into memory usage bottlenecks.

I've tested compiling my model with varying numbers of transformer layers - the memory usage at compile time for CPU and NPU are both linear w.r.t. the number of operations in my model, however my GPU memory usage is constant and does not change, even if I double the number of layers being used. I'm wondering why this may be the case?

Best,

ayf

Peh_Intel · ‎07-09-2024

Hi ayf7,

As such, could you provide your OpenVINO IR files with us for further investigation?

Regards,

Peh

Peh_Intel · ‎07-21-2024

Hi ayf7,

Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.

Regards,

Peh

Memory usage during compilation of different devices

Inference Engine

Model Optimizer