pyannote/speaker-diarization-3.1

handsome · ‎04-13-2024

This issue has been troubling me for quite some time, and I'm curious why OpenVINO isn't running smoothly on my iGPU (while it runs fine on CPU).

I've followed a more traditional approach by converting to ONNX first and then to IR files. Since Intel has collaborated with Hugging Face, primarily focusing on transformer-type models, I haven't found OV versions of Pyannote-related models available for direct download. I'm eager to learn if there are better suggested methods.

Here's my code:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token="hf_aaaaaaaaa")
torch.onnx.export(pipeline._segmentation.model, torch.zeros((1, 1, 80000)), onnx_path, input_names=["chunks"], output_names=["outputs"], dynamic_axes={"chunks": {0: "batch_size", 2: "wave_len"}})
ov_speaker_segmentation = ov.convert_model(onnx_path)
device = 'GPU.0' #(This is the iGPU)
ov_seg_model = core.compile_model(ov_speaker_segmentation, device)
def infer_segm(chunks: torch.Tensor) -> np.ndarray:
    """
    Inference speaker segmentation mode using OpenVINO
    Parameters:
        chunks (torch.Tensor) input audio chunks
    Return:
        segments (np.ndarray)
    """
    res = ov_seg_model(chunks)
    return res[ov_seg_out]
pipeline._segmentation.infer = infer_segm
diarization = pipeline(audio_file_path)

I've also observed two things. When I use OpenVINO 2024.0.0, it throws an error during inference with the following error message:

RuntimeError: Exception from src\inference\src\cpp\infer_request.cpp:223:
Check 'TRShape::merge_into(output_shape, in_copy)' failed at src\core\shape_inference\include\concat_shape_inference.hpp:49:
While validating node 'opset1::Concat lstmcell:LSTMCell_18558_inputConcat () -> ()' with friendly_name 'lstmcell:LSTMCell_18558_inputConcat':
Shape inference input shapes {[32,60],[1,128]}
Argument shapes are inconsistent; they must have the same rank, and must have equal dimension everywhere except on the concatenation axis (axis 1).

And with OpenVINO 2023.2.0, it throws an error during compilation with the following error message:

Traceback (most recent call last):
  File "C:\Users\handsome\Desktop\OpenVINO\Speaker_recognition\OpenVINO_speaker_recognition.py", line 135, in <module>
    ov_seg_model = core.compile_model(ov_speaker_segmentation, device)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\handsome\.conda\envs\OpenVINO_pyannote_audio3\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\core.cpp:116:
[ GENERAL_ERROR ] get_shape was called on a descriptor::Tensor with dynamic shape

It seems to be a shape issue, but I'm unsure how to address it or where I can find documentation on how to modify it. For example, where can I reshape the data?

thx,everyone.

Aznie_Intel · ‎04-15-2024

Hi Handsome,

Thanks for reaching out.

There is Speaker diarization tutorial in Jupyter Notebook that might reference you. As for now, Pyannote-related models are not available for direct download.

Regarding the error, please make sure you use model IR with the corresponding OpenVINO runtime, so if your IR is from 2023 release then you need to run inference using OpenVINO 2023. For the second error, it shows that you are using a dynamic shape model. Currently, OpenVINO only supports static shapes when running inference on Intel GPUs. You may refer to these Optimum Inference with OpenVINO.

Regards,

Aznie

Aznie_Intel · ‎04-23-2024

Hi Handsome,

This thread will no longer be monitored since we have provided information. If you need any additional information from Intel, please submit a new question.

Regards,

Aznie