Llama2-7b inference using openvino-genai

Shravanthi · ‎11-11-2024

Hello,

We have followed the instructions given in below link to run llama2-7b also we have tried with nightly version but in both the approach we are facing similar error "File not found: openvino_tokenizer.xml". When we are trying to install openvino_tokenizer we are getting lot of errors. Could you please let us know if there is any other way to run llama model with openvino_genai

https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

Thanks

Aznie_Intel · ‎11-12-2024

Hi Shravanthi,

Thanks for reaching out. Can you share the screenshot of your TinyLlama directory? Does openvino_tokenizer (.xml and .bin) files available in the directory? I have exported the TinyLlama but the files are not available from my end. When exporting the LLM models, the directory should include the openvino_tokenizer files. Below are the files when exporting mistral-7b-instruct-v0.1-int8-ov model:

Regards,

Aznie

Shravanthi · ‎11-14-2024

Hi Aznie,

We are also facing same issue when we exported TinyLlama and Llama2-7b models openvino_tokenizers and openvino_detokenizers files are not available in the directly. We tried to use these tokenizers files from another source of Llama2-7b but we got Error: Cannot create SpecialTokensSplit layer, below is the screenshot of error

Thanks

Aznie_Intel · ‎11-14-2024

Hi Sharavanthi,

We are checking this with the development team and will get back to you soon.

Regards,

Aznie

Shravanthi · ‎11-18-2024

Hi Aznie,

Do you have any update on this ?

Thanks

Shravanthi

Witold_Intel · ‎11-21-2024

Hello Shravanthi,

Your case is currently with me. I opened an issue with OpenVINO developers to discuss the details. I will get back to you as soon as I know more.

Witold_Intel · ‎11-27-2024

Hello Shravanthi,

I didn't receive any response from the developers yet. Please bear with us.

Witold_Intel · ‎11-29-2024

Our developer was able to replicate your case, here's the output:

(venv20245) apaniuko@IRL-ODT-08:~/python/openvino_tokenizers/benchmark$ optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --group-size 128 --ratio 1.0 TinyLlama

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.

We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)

/home/apaniuko/python/openvino_tokenizers/benchmark/venv20245/lib/python3.10/site-packages/transformers/cache_utils.py:458: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.

or len(self.key_cache[layer_idx]) == 0 # the layer has no cache

/home/apaniuko/python/openvino_tokenizers/benchmark/venv20245/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

if sequence_length != 1:

/home/apaniuko/python/openvino_tokenizers/benchmark/venv20245/lib/python3.10/site-packages/transformers/cache_utils.py:443: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.

elif len(self.key_cache[layer_idx]) == 0: # fills previously skipped layers; checking for tensor causes errors

INFO:nncf:Statistics of the bitwidth distribution:

┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑

│ Weight compression mode │ % all parameters (layers) │ % ratio-defining parameters (layers) │

┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥

│ int8_asym │ 12% (2 / 156) │ 0% (0 / 154) │

├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤

│ int4_sym │ 88% (154 / 156) │ 100% (154 / 154) │

┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙

Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 0:00:40 • 0:00:00

(venv20245) apaniuko@IRL-ODT-08:~/python/openvino_tokenizers/benchmark$ ls TinyLlama/

config.json openvino_detokenizer.bin openvino_model.bin openvino_tokenizer.bin special_tokens_map.json tokenizer.json

generation_config.json openvino_detokenizer.xml openvino_model.xml openvino_tokenizer.xml tokenizer_config.json tokenizer.model

We're suspecting it to be an environment issue. Could you check your packages please? Here's the list from our side:

about-time==4.2.1

aiohappyeyeballs==2.4.3

aiohttp==3.11.8

aiosignal==1.3.1

alive-progress==3.2.0

async-timeout==5.0.1

attrs==24.2.0

autograd==1.7.0

certifi==2024.8.30

charset-normalizer==3.4.0

cma==3.2.2

coloredlogs==15.0.1

contourpy==1.3.1

cycler==0.12.1

datasets==3.1.0

Deprecated==1.2.15

dill==0.3.8

filelock==3.16.1

fonttools==4.55.0

frozenlist==1.5.0

fsspec==2024.9.0

grapheme==0.6.0

huggingface-hub==0.26.2

humanfriendly==10.0

idna==3.10

Jinja2==3.1.4

joblib==1.4.2

jsonschema==4.23.0

jsonschema-specifications==2024.10.1

jstyleson==0.0.2

kiwisolver==1.4.7

markdown-it-py==3.0.0

MarkupSafe==3.0.2

matplotlib==3.9.2

mdurl==0.1.2

mpmath==1.3.0

multidict==6.1.0

multiprocess==0.70.16

natsort==8.4.0

networkx==3.3

ninja==1.11.1.2

nncf==2.14.0

numpy==2.1.3

nvidia-cublas-cu12==12.4.5.8

nvidia-cuda-cupti-cu12==12.4.127

nvidia-cuda-nvrtc-cu12==12.4.127

nvidia-cuda-runtime-cu12==12.4.127

nvidia-cudnn-cu12==9.1.0.70

nvidia-cufft-cu12==11.2.1.3

nvidia-curand-cu12==10.3.5.147

nvidia-cusolver-cu12==11.6.1.9

nvidia-cusparse-cu12==12.3.1.170

nvidia-nccl-cu12==2.21.5

nvidia-nvjitlink-cu12==12.4.127

nvidia-nvtx-cu12==12.4.127

onnx==1.17.0

openvino==2024.5.0

openvino-genai==2024.5.0.0

openvino-telemetry==2024.5.0

openvino-tokenizers==2024.5.0.0

optimum==1.23.3

optimum-intel==1.20.1

packaging==24.2

pandas==2.2.3

pillow==11.0.0

propcache==0.2.0

protobuf==5.28.3

psutil==6.1.0

pyarrow==18.1.0

pydot==2.0.0

Pygments==2.18.0

pymoo==0.6.1.3

pyparsing==3.2.0

python-dateutil==2.9.0.post0

pytz==2024.2

PyYAML==6.0.2

referencing==0.35.1

regex==2024.11.6

requests==2.32.3

rich==13.9.4

rpds-py==0.21.0

safetensors==0.4.5

scikit-learn==1.5.2

scipy==1.14.1

sentencepiece==0.2.0

six==1.16.0

sympy==1.13.1

tabulate==0.9.0

threadpoolctl==3.5.0

tokenizers==0.20.3

torch==2.5.1

tqdm==4.67.1

transformers==4.46.3

triton==3.1.0

typing_extensions==4.12.2

tzdata==2024.2

urllib3==2.2.3

wrapt==1.17.0

xxhash==3.5.0

yarl==1.18.0

You could also try to convert tokenizers separately with this command:

(venv20245) apaniuko@IRL-ODT-08:~/python/openvino_tokenizers/benchmark$ convert_tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --with-detokenizer --left-padding -o TinyLlama/

Loading Huggingface Tokenizer...

Converting Huggingface Tokenizer to OpenVINO...

Saved OpenVINO Tokenizer: TinyLlama/openvino_tokenizer.xml, TinyLlama/openvino_tokenizer.bin

Saved OpenVINO Detokenizer: TinyLlama/openvino_detokenizer.xml, TinyLlama/openvino_detokenizer.bin

Witold_Intel · ‎12-12-2024

Hi Shravanthi,

Have you been able to try the workaround suggested? Please respond within 5 working days, otherwise I will have to deescalate this issue.

Aznie_Intel · ‎12-16-2024

Hello Shravanthi,

Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.

Regards,

Aznie