Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msdd_model.diarize() RuntimeError: shape '[138, 50, 16, 192]' is invalid for input of size 84787200 #130

Open
Ko8rah opened this issue Nov 24, 2023 · 4 comments

Comments

@Ko8rah
Copy link

Ko8rah commented Nov 24, 2023

Hello,

I have an issue while running the notebook with the msdd_model.diarize() method:

[NeMo I 2023-11-24 10:01:36 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-11-24 10:01:36 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-11-24 10:01:36 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2023-11-24 10:01:36 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-11-24 10:01:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2023-11-24 10:01:38 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2023-11-24 10:01:38 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false
    
[NeMo I 2023-11-24 10:01:38 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:38 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:39 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-11-24 10:01:39 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:40 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-11-24 10:01:40 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-11-24 10:01:40 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2023-11-24 10:01:40 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-11-24 10:01:40 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true
    
[NeMo W 2023-11-24 10:01:40 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true
    
[NeMo W 2023-11-24 10:01:40 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0
    
[NeMo I 2023-11-24 10:01:40 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:40 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-11-24 10:01:40 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1, 1]
[NeMo I 2023-11-24 10:01:40 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo W 2023-11-24 10:01:40 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-11-24 10:01:40 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-11-24 10:01:40 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  1.88it/s][NeMo I 2023-11-24 10:01:41 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2023-11-24 10:01:41 classification_models:272] Perform streaming frame-level VAD
[NeMo I 2023-11-24 10:01:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:41 collections:302] Dataset loaded with 12 items, total duration of  0.16 hours.
[NeMo I 2023-11-24 10:01:41 collections:304] # 12 files loaded accounting to # 1 labels

vad: 100%|██████████| 12/12 [00:04<00:00,  2.46it/s][NeMo I 2023-11-24 10:01:46 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.

creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.43it/s][NeMo I 2023-11-24 10:01:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2023-11-24 10:01:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:46 collections:302] Dataset loaded with 381 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:46 collections:304] # 381 files loaded accounting to # 1 labels

[1/6] extract embeddings: 100%|██████████| 6/6 [00:01<00:00,  3.13it/s][NeMo I 2023-11-24 10:01:48 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:48 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2023-11-24 10:01:48 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:48 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:48 collections:302] Dataset loaded with 457 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:48 collections:304] # 457 files loaded accounting to # 1 labels

[2/6] extract embeddings: 100%|██████████| 8/8 [00:02<00:00,  3.86it/s][NeMo I 2023-11-24 10:01:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2023-11-24 10:01:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:50 collections:302] Dataset loaded with 574 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:50 collections:304] # 574 files loaded accounting to # 1 labels

[3/6] extract embeddings: 100%|██████████| 9/9 [00:02<00:00,  3.30it/s][NeMo I 2023-11-24 10:01:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2023-11-24 10:01:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:53 collections:302] Dataset loaded with 764 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:53 collections:304] # 764 files loaded accounting to # 1 labels

[4/6] extract embeddings: 100%|██████████| 12/12 [00:03<00:00,  3.92it/s][NeMo I 2023-11-24 10:01:56 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:56 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2023-11-24 10:01:56 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:56 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:56 collections:302] Dataset loaded with 1148 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:56 collections:304] # 1148 files loaded accounting to # 1 labels

[5/6] extract embeddings: 100%|██████████| 18/18 [00:03<00:00,  4.69it/s][NeMo I 2023-11-24 10:02:00 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:02:00 clustering_diarizer:287] Subsegmentation for embedding extraction: scale5, /content/temp_outputs/speaker_outputs/subsegments_scale5.json
[NeMo I 2023-11-24 10:02:00 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:02:00 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:02:00 collections:302] Dataset loaded with 2296 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:02:00 collections:304] # 2296 files loaded accounting to # 1 labels

[6/6] extract embeddings: 100%|██████████| 36/36 [00:05<00:00,  6.98it/s]
[NeMo I 2023-11-24 10:02:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
clustering: 100%|██████████| 1/1 [00:00<00:00,  1.37it/s][NeMo I 2023-11-24 10:02:06 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory

[NeMo W 2023-11-24 10:02:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:5 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale5_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale5_cluster.label
[NeMo I 2023-11-24 10:02:07 collections:617] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-11-24 10:02:07 collections:620] Total 3 session files loaded accounting to # 3 audio clips
  0%|          | 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-13-8cafa8c83657>](https://localhost:8080/#) in <cell line: 3>()
      1 # Initialize NeMo MSDD diarization model
      2 msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda")
----> 3 msdd_model.diarize()
      4 
      5 del msdd_model

12 frames
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/modules/msdd_diarizer.py](https://localhost:8080/#) in conv_forward(self, conv_input, conv_module, bn_module, first_layer)
    417         conv_out = conv_module(conv_input)
    418         conv_out = conv_out.permute(0, 2, 1, 3) if not first_layer else conv_out
--> 419         conv_out = conv_out.reshape(self.batch_size, self.length, self.cnn_output_ch, self.emb_dim)
    420         conv_out = conv_out.unsqueeze(2).flatten(0, 1)
    421         conv_out = bn_module(conv_out.permute(0, 3, 2, 1)).permute(0, 3, 2, 1)

RuntimeError: shape '[138, 50, 16, 192]' is invalid for input of size 84787200

Do you have any hint on how to solve this issue ?

@MahmoudAshraf97
Copy link
Owner

Can you upload the audio you are using so I can reproduce it?

@Ko8rah
Copy link
Author

Ko8rah commented Nov 24, 2023

Yes of course. Thank you for your reactivity.
The audio is a french podcast and I'm running the notebook on colab with free T4 GPU.

podcast.mp3.zip

@mjsteele12
Copy link

@Ko8rah did you ever find a solution? I am having the same problem.

@MahmoudAshraf97
Copy link
Owner

The problem is caused by using meeting or general config, both are non-supported as for now, stick to telephonic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants