- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am interested in deploying the int4 quantized models generated with https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/llm/runtime/graph#how-to-use-python-script on OpenVINO Model Server (https://github.com/openvinotoolkit/model_server).
I am not sure if OpenVINO Model Server is the correct tool for the job. It seems that it is though given AMX is supported in the OpenVINO platform.
I'd like to run an LLM like LLaMA2 from Meta on the model server. I've tried following https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md, but as documented in https://github.com/openvinotoolkit/model_server/issues/2218, I am unable to get it working.
I was able to get int8 quantization in the demo by running:
pip install nncf
python3 download_model.py
How could I get int4?
In summary:
- How can I deploy an int4 optimized LLaMA2 model to OpenVINO?
- What do I need to do to get the demo working? Might someone be able to update it?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi cphoward,
Thank you for reaching out to us.
We're investigating this issue and will update you on any findings as soon as possible.
Regards,
Hairul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Hairul_Intel,
Thank you for looking into this for me. I have not been able to get the chat demonstration working, but I have gotten other demostrations working. Hopefully that should rule out model server configurations on my side.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi cphoward,
As mentioned in the GitHub thread, the demo that you're referring to has been removed recently.
As suggested in the GitHub thread, you can check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation
This thread will no longer be monitored since we have provided information. If you need any additional information from Intel, please submit a new question.
Regards,
Hairul

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page