Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression bug: NonMatchingSplitsSizesError for (possibly) overwritten dataset #6896

Open
finiteautomata opened this issue May 13, 2024 · 0 comments

Comments

@finiteautomata
Copy link

finiteautomata commented May 13, 2024

Describe the bug

While trying to load the dataset https://huggingface.co/datasets/pysentimiento/spanish-tweets-small, I get this error:

---------------------------------------------------------------------------
NonMatchingSplitsSizesError               Traceback (most recent call last)
[<ipython-input-1-d6a3c721d3b8>](https://localhost:8080/#) in <cell line: 3>()
      1 from datasets import load_dataset
      2 
----> 3 ds = load_dataset("pysentimiento/spanish-tweets-small")

3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2150 
   2151     # Download and prepare data
-> 2152     builder_instance.download_and_prepare(
   2153         download_config=download_config,
   2154         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    946                         if num_proc is not None:
    947                             prepare_split_kwargs["num_proc"] = num_proc
--> 948                         self._download_and_prepare(
    949                             dl_manager=dl_manager,
    950                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1059 
   1060         if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:
-> 1061             verify_splits(self.info.splits, split_dict)
   1062 
   1063         # Update the info object with the splits.

[/usr/local/lib/python3.10/dist-packages/datasets/utils/info_utils.py](https://localhost:8080/#) in verify_splits(expected_splits, recorded_splits)
     98     ]
     99     if len(bad_splits) > 0:
--> 100         raise NonMatchingSplitsSizesError(str(bad_splits))
    101     logger.info("All the splits matched successfully.")
    102 

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=82649695458, num_examples=597433111, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=3358310095, num_examples=24898932, shard_lengths=[3626991, 3716991, 4036990, 3506990, 3676990, 3716990, 2616990], dataset_name='spanish-tweets-small')}]

I think I had this dataset updated, might be related to #6271

It is working fine as late in 2.10.0 , but not in 2.13.0 onwards.

Steps to reproduce the bug

from datasets import load_dataset

ds = load_dataset("pysentimiento/spanish-tweets-small")

You can run it in this notebook

Expected behavior

Load the dataset without any error

Environment info

  • datasets version: 2.13.0
  • Platform: Linux-6.1.58+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.3
  • PyArrow version: 14.0.2
  • Pandas version: 2.0.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant