Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper-large-v3 (in demo code) VS whisper-large-v2 (in kaggle notebook) #193

Open
Binah-CS opened this issue May 12, 2024 · 0 comments
Open

Comments

@Binah-CS
Copy link

@sanchit-gandhi - the demo is using the v3 dataset, but the kaggle notebook and readme - all reference v2.
In our tests, the v3 give significantly better output for our test audio files than v2, but if I try to update the notebook to v3, it doesn't even compile (v2 compiled just fine). Can you point me in the right direction?


ValueError Traceback (most recent call last)
Cell In[7], line 4
1 from whisper_jax import FlaxWhisperPipline
2 import jax.numpy as jnp
----> 4 pipeline = FlaxWhisperPipline("openai/whisper-large-v3", dtype=jnp.bfloat16, batch_size=16)

File /usr/local/lib/python3.8/site-packages/whisper_jax/pipeline.py:82, in FlaxWhisperPipline.init(self, checkpoint, dtype, batch_size, max_length)
79 self.checkpoint = checkpoint
80 self.dtype = dtype
---> 82 self.processor = WhisperProcessor.from_pretrained(self.checkpoint)
83 self.feature_extractor = self.processor.feature_extractor
84 # potentially load fast tokenizer if available

File /usr/local/lib/python3.8/site-packages/transformers/processing_utils.py:226, in ProcessorMixin.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, **kwargs)
223 if token is not None:
224 kwargs["token"] = token
--> 226 args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
227 return cls(*args)

File /usr/local/lib/python3.8/site-packages/transformers/processing_utils.py:270, in ProcessorMixin._get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
267 else:
268 attribute_class = getattr(transformers_module, class_name)
--> 270 args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
271 return args

File /usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
1851 else:
1852 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1854 return cls._from_pretrained(
1855 resolved_vocab_files,
1856 pretrained_model_name_or_path,
1857 init_configuration,
1858 *init_inputs,
1859 token=token,
1860 cache_dir=cache_dir,
1861 local_files_only=local_files_only,
1862 _commit_hash=commit_hash,
1863 _is_local=is_local,
1864 **kwargs,
1865 )

File /usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2066, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
2062 current_index = len(tokenizer) + len(tokens)
2063 if has_tokenizer_file and index != current_index and tokenizer.convert_tokens_to_ids(token) != index:
2064 # Tokenizer fast: added token needs to either be in the vocabulary with the proper index or the
2065 # index is the current length of the tokenizer (not in vocabulary)
-> 2066 raise ValueError(
2067 f"Wrong index found for {token}: should be {tokenizer.convert_tokens_to_ids(token)} but found "
2068 f"{index}."
2069 )
2070 elif not has_tokenizer_file and index != current_index:
2071 # Tokenizer slow: added token cannot already be in the vocabulary so its index needs to be the
2072 # current length of the tokenizer.
2073 raise ValueError(
2074 f"Non-consecutive added token '{token}' found. "
2075 f"Should have index {current_index} but has index {index} in saved vocabulary."
2076 )

ValueError: Wrong index found for <|0.02|>: should be None but found 50366.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant