- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running the sample code for optimization/quantization of Dolly V2, to better understand OpenVINO and to determine if it is applicable to our projects with various models such as FLAN-T5, FLAN-UL2, and MPT. Here are my questions:
First, in the following line
ov_model = OVModelForCausalLM.from_pretrained(model_id, device=current_device, export=True)
parameter device pertains to the target device for inference optimization, as per documentation. For example, device can be "CPU" or "GPU". Where can I find a list or documentation for devices which may be specified besides these two. For instance, how to specify optimizing for ARM, or iGPU.
Second, the sample code does not address quantization to FP16 or INT-8. How can this be explicitly specified via the Optimum/OpenVINO API. Do I have to perhaps modify the sample code as follows:
tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device=current_device, load_in_8bit=True, export=True)
Third, the source code for class OVModel sets self.device = torch.device("cpu"). I interpret this as OVModelForCasualLM instance to run on CPU only. Is it possible to expedite execution of the sample code on a GPU? It takes an extremely long time on a commodity CPU.
I apologize for asking basic questions, but I have searched thru OpenVINO and Optimum Hugging Face documentation and code without much success or clarity. I am trying to formulate a plan for our team projects on the usability/applicability of OpenVINO toolkit.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Berdy,
Thanks for reaching out.
Currently, OpenVINO only supports static shapes when running inference on Intel GPUs. To speed up inference, static shapes can be enabled by giving the desired input shapes. You may refer to this Optimum Inference with OpenVINO.
# Fix the batch size to 1 and the sequence length to 9
model.reshape(1, 9)
# Enable FP16 precision
model.half()
model.to("gpu")
# Compile the model before the first inference
model.compile()
For the supported devices with OpenVINO, you may check the Supported Devices. The optimization of models will only generate the Int8 precision. You may find the compression methods in this Optimization documentation.
Regards,
Aznie
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Aznie,
Thanks for the links provided in your reply. I am however experiencing an issue with processing capacity. I am simply attempting to run the sample code from https://docs.openvino.ai/2023.0/notebooks/240-dolly-2-instruction-following-with-output.html
The following line ran for more than 72 hours consuming all available resources, until the OS Kernel Killed the process.
ov_model = OVModelForCausalLM.from_pretrained('databricks/dolly-v2-3b', device='CPU', export=True)
The process ran on a Dell Precission 7540, Xeon E-2286M x16, 64GB Memory, 1.3TB available disk capacity.
Are there any compute capacity guidelines available to determine minimum requirements? Please note that this is a small 3B model size, while the goal is to eventually run with up to 20B size models.
Obviously, I am unable to adopt the guidelines in your reply until I pass this point in the process.
Thanks,
Berdy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Berdy,
I encountered the same Jupyter Kernel crashing when running the notebooks.
I have highlighted this issue and the storage requirement with the relevant team. It might take some time for the rectification. I will provide you with the latest updates once available from their end.
Regards,
Aznie
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aznie,
Thanks for the reply, standing-by for further updates on this matter.
Best,
Berdy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Berdy,
Thank you for your patience.
Update from our developer that the issue is resolved with the latest pull request. Therefore, this thread will no longer be monitored since the issue has been resolved. If you need any additional information from Intel, please submit a new question.
Regards,
Aznie

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page