Using OVModelForCausalLM

Berdy · ‎06-30-2023

I am running the sample code for optimization/quantization of Dolly V2, to better understand OpenVINO and to determine if it is applicable to our projects with various models such as FLAN-T5, FLAN-UL2, and MPT. Here are my questions:

First, in the following line

ov_model = OVModelForCausalLM.from_pretrained(model_id, device=current_device, export=True)

parameter device pertains to the target device for inference optimization, as per documentation. For example, device can be "CPU" or "GPU". Where can I find a list or documentation for devices which may be specified besides these two. For instance, how to specify optimizing for ARM, or iGPU.

Second, the sample code does not address quantization to FP16 or INT-8. How can this be explicitly specified via the Optimum/OpenVINO API. Do I have to perhaps modify the sample code as follows:

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device=current_device, load_in_8bit=True, export=True)

Third, the source code for class OVModel sets self.device = torch.device("cpu"). I interpret this as OVModelForCasualLM instance to run on CPU only. Is it possible to expedite execution of the sample code on a GPU? It takes an extremely long time on a commodity CPU.

I apologize for asking basic questions, but I have searched thru OpenVINO and Optimum Hugging Face documentation and code without much success or clarity. I am trying to formulate a plan for our team projects on the usability/applicability of OpenVINO toolkit.

Aznie_Intel · ‎07-03-2023

Hi Berdy,

Thanks for reaching out.

Currently, OpenVINO only supports static shapes when running inference on Intel GPUs. To speed up inference, static shapes can be enabled by giving the desired input shapes. You may refer to this Optimum Inference with OpenVINO.

# Fix the batch size to 1 and the sequence length to 9

model.reshape(1, 9)

# Enable FP16 precision

model.half()

model.to("gpu")

# Compile the model before the first inference

model.compile()

For the supported devices with OpenVINO, you may check the Supported Devices. The optimization of models will only generate the Int8 precision. You may find the compression methods in this Optimization documentation.

Regards,

Aznie

Berdy · ‎07-05-2023

Hi Aznie,

Thanks for the links provided in your reply. I am however experiencing an issue with processing capacity. I am simply attempting to run the sample code from https://docs.openvino.ai/2023.0/notebooks/240-dolly-2-instruction-following-with-output.html

The following line ran for more than 72 hours consuming all available resources, until the OS Kernel Killed the process.

ov_model = OVModelForCausalLM.from_pretrained('databricks/dolly-v2-3b', device='CPU', export=True)

The process ran on a Dell Precission 7540, Xeon E-2286M x16, 64GB Memory, 1.3TB available disk capacity.

Are there any compute capacity guidelines available to determine minimum requirements? Please note that this is a small 3B model size, while the goal is to eventually run with up to 20B size models.

Obviously, I am unable to adopt the guidelines in your reply until I pass this point in the process.

Thanks,

Berdy

Aznie_Intel · ‎07-08-2023

Hi Berdy,

I encountered the same Jupyter Kernel crashing when running the notebooks.

I have highlighted this issue and the storage requirement with the relevant team. It might take some time for the rectification. I will provide you with the latest updates once available from their end.

Regards,

Aznie

Berdy · ‎07-10-2023

Aznie,

Thanks for the reply, standing-by for further updates on this matter.

Best,

Berdy

Aznie_Intel · ‎09-04-2023

Hi Berdy,

Thank you for your patience.

Update from our developer that the issue is resolved with the latest pull request. Therefore, this thread will no longer be monitored since the issue has been resolved. If you need any additional information from Intel, please submit a new question.

Regards,

Aznie