Quantized LLM Models on AMX-supported Sapphire Rapids CPUs

cphoward · ‎12-20-2023

I am interested in deploying the int4 quantized models generated with https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/llm/runtime/graph#how-to-use-python-script on OpenVINO Model Server (https://github.com/openvinotoolkit/model_server).

I am not sure if OpenVINO Model Server is the correct tool for the job. It seems that it is though given AMX is supported in the OpenVINO platform.

I'd like to run an LLM like LLaMA2 from Meta on the model server. I've tried following https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md, but as documented in https://github.com/openvinotoolkit/model_server/issues/2218, I am unable to get it working.

I was able to get int8 quantization in the demo by running:

pip install nncf
python3 download_model.py

How could I get int4?

In summary:

- How can I deploy an int4 optimized LLaMA2 model to OpenVINO?

- What do I need to do to get the demo working? Might someone be able to update it?

Hairul_Intel · ‎12-21-2023

Hi cphoward,

Thank you for reaching out to us.

We're investigating this issue and will update you on any findings as soon as possible.

Regards,

Hairul

cphoward · ‎12-21-2023

Hi Hairul_Intel,

Thank you for looking into this for me. I have not been able to get the chat demonstration working, but I have gotten other demostrations working. Hopefully that should rule out model server configurations on my side.

Hairul_Intel · ‎01-25-2024

Hi cphoward,

As mentioned in the GitHub thread, the demo that you're referring to has been removed recently.

As suggested in the GitHub thread, you can check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation

This thread will no longer be monitored since we have provided information. If you need any additional information from Intel, please submit a new question.

Regards,

Hairul

Quantized LLM Models on AMX-supported Sapphire Rapids CPUs

Documentation

Inference Engine