whisper-large-v3 (in demo code) VS whisper-large-v2 (in kaggle notebook) #193

Binah-CS · 2024-05-12T23:02:25Z

@sanchit-gandhi - the demo is using the v3 dataset, but the kaggle notebook and readme - all reference v2.
In our tests, the v3 give significantly better output for our test audio files than v2, but if I try to update the notebook to v3, it doesn't even compile (v2 compiled just fine). Can you point me in the right direction?

ValueError Traceback (most recent call last)
Cell In[7], line 4
1 from whisper_jax import FlaxWhisperPipline
2 import jax.numpy as jnp
----> 4 pipeline = FlaxWhisperPipline("openai/whisper-large-v3", dtype=jnp.bfloat16, batch_size=16)

File /usr/local/lib/python3.8/site-packages/whisper_jax/pipeline.py:82, in FlaxWhisperPipline.init(self, checkpoint, dtype, batch_size, max_length)
79 self.checkpoint = checkpoint
80 self.dtype = dtype
---> 82 self.processor = WhisperProcessor.from_pretrained(self.checkpoint)
83 self.feature_extractor = self.processor.feature_extractor
84 # potentially load fast tokenizer if available

File /usr/local/lib/python3.8/site-packages/transformers/processing_utils.py:226, in ProcessorMixin.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, **kwargs)
223 if token is not None:
224 kwargs["token"] = token
--> 226 args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
227 return cls(*args)

File /usr/local/lib/python3.8/site-packages/transformers/processing_utils.py:270, in ProcessorMixin._get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
267 else:
268 attribute_class = getattr(transformers_module, class_name)
--> 270 args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
271 return args

File /usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
1851 else:
1852 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1854 return cls._from_pretrained(
1855 resolved_vocab_files,
1856 pretrained_model_name_or_path,
1857 init_configuration,
1858 *init_inputs,
1859 token=token,
1860 cache_dir=cache_dir,
1861 local_files_only=local_files_only,
1862 _commit_hash=commit_hash,
1863 _is_local=is_local,
1864 **kwargs,
1865 )

File /usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2066, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
2062 current_index = len(tokenizer) + len(tokens)
2063 if has_tokenizer_file and index != current_index and tokenizer.convert_tokens_to_ids(token) != index:
2064 # Tokenizer fast: added token needs to either be in the vocabulary with the proper index or the
2065 # index is the current length of the tokenizer (not in vocabulary)
-> 2066 raise ValueError(
2067 f"Wrong index found for {token}: should be {tokenizer.convert_tokens_to_ids(token)} but found "
2068 f"{index}."
2069 )
2070 elif not has_tokenizer_file and index != current_index:
2071 # Tokenizer slow: added token cannot already be in the vocabulary so its index needs to be the
2072 # current length of the tokenizer.
2073 raise ValueError(
2074 f"Non-consecutive added token '{token}' found. "
2075 f"Should have index {current_index} but has index {index} in saved vocabulary."
2076 )

ValueError: Wrong index found for <|0.02|>: should be None but found 50366.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper-large-v3 (in demo code) VS whisper-large-v2 (in kaggle notebook) #193

whisper-large-v3 (in demo code) VS whisper-large-v2 (in kaggle notebook) #193

Binah-CS commented May 12, 2024

whisper-large-v3 (in demo code) VS whisper-large-v2 (in kaggle notebook) #193

whisper-large-v3 (in demo code) VS whisper-large-v2 (in kaggle notebook) #193

Comments

Binah-CS commented May 12, 2024