Speedup OVModelForCasualLM

Berdy · ‎06-30-2023

I am running the sample code for optimization/quantization of Dolly V2, to better understand OpenVINO and to determine if it is applicable to our projects with various models such as FLAN-T5, FLAN-UL2, and MPT. Here are my questions:

First, in the following line

ov_model = OVModelForCausalLM.from_pretrained(model_id, device=current_device, export=True)

parameter device pertains to the target device for inference optimization, as per documentation. For example, device can be "CPU" or "GPU". Where can I find a list or documentation for devices which may be specified besides these two. For instance, how to specify optimizing for ARM, or iGPU.

Second, the sample code does not address quantization to FP16 or INT-8. How can this be explicitly specified via the Optimum/OpenVINO API. Do I have to perhaps modify the sample code as follows:

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=current_device, load_in_8bit=True, export=True)

Third, the source code for class OVModel sets self.device = torch.device("cpu"). I interpret this as OVModelForCasualLM instance to run on CPU only. Is it possible to expedite execution of the sample code on a GPU? It takes an extremely long time on a commodity CPU.

I apologize for asking basic questions, but I have searched thru OpenVINO and Optimum Hugging Face documentation and code without much success or clarity. I am trying to formulate a plan for our team projects on the usability/applicability of OpenVINO toolkit.

Berdy · ‎06-30-2023

Please ignore this post. Please refer to post subject: Using OVModelForCausalLM.

Aznie_Intel · ‎07-02-2023

Hi Berdy,

As requested, I will close this case and we will continue providing support on your second post.

Regards,

Aznie