Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6524 Discussions

Quantized LLM Models on AMX-supported Sapphire Rapids CPUs

cphoward
Beginner
1,383 Views

I am interested in deploying the int4 quantized models generated with https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/llm/runtime/graph#how-to-use-python-script on OpenVINO Model Server (https://github.com/openvinotoolkit/model_server).


I am not sure if OpenVINO Model Server is the correct tool for the job. It seems that it is though given AMX is supported in the OpenVINO platform.


I'd like to run an LLM like LLaMA2 from Meta on the model server. I've tried following https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md, but as documented in https://github.com/openvinotoolkit/model_server/issues/2218, I am unable to get it working.

 

I was able to get int8 quantization in the demo by running:

pip install nncf
python3 download_model.py

How could I get int4?

In summary:

- How can I deploy an int4 optimized LLaMA2 model to OpenVINO?

- What do I need to do to get the demo working? Might someone be able to update it?

 

Labels (2)
0 Kudos
3 Replies
Hairul_Intel
Moderator
1,353 Views

Hi cphoward,

Thank you for reaching out to us.

 

We're investigating this issue and will update you on any findings as soon as possible.

 

 

Regards,

Hairul


0 Kudos
cphoward
Beginner
1,328 Views

Hi Hairul_Intel,

Thank you for looking into this for me. I have not been able to get the chat demonstration working, but I have gotten other demostrations working. Hopefully that should rule out model server configurations on my side.

0 Kudos
Hairul_Intel
Moderator
1,117 Views

Hi cphoward,

As mentioned in the GitHub thread, the demo that you're referring to has been removed recently.

 

As suggested in the GitHub thread, you can check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation

 

This thread will no longer be monitored since we have provided information. If you need any additional information from Intel, please submit a new question.

 

 

Regards,

Hairul

 

 


0 Kudos
Reply