Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpin hfh #6876

Merged
merged 5 commits into from
May 27, 2024
Merged

Unpin hfh #6876

merged 5 commits into from
May 27, 2024

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented May 6, 2024

Needed to use those in dataset-viewer:

close #6863

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explain in the corresponding issue that I self-assigned (#6863), I was planning to unpin the upper bound of huggingface-hub < 0.23.0 only when the huggingface/transformers#30618 fix is merged and released in a new transformers version, otherwise it breaks our CI. The fix has been merged but not released yet.

Also note that datasets-2.19.1 (version currently installed in the dataset-viewer) does not include the pin in huggingface-hub:
https://github.com/huggingface/datasets/blob/2.19.1/setup.py#L138

"huggingface-hub>=0.21.2",

If we urgently need some dev feature for dataset-viewer, I would suggest pushing the feature (cherry-picked) to a dedicated branch with 2.19.1 as its starting point (without opening a PR), and install datasets from that branch.

@lhoestq
Copy link
Member Author

lhoestq commented May 7, 2024

transformers 4.40.2 was release yesterday but not sure if it contains the fix

@albertvillanova
Copy link
Member

albertvillanova commented May 7, 2024

@lhoestq yes I knew transformers 4.40.2 was released yesterday, but I had checked that it does not contain the fix: only 2 bug fixes. That is why our CI continues failing in this PR. We will have to wait until the next minor version.

@albertvillanova
Copy link
Member

If we urgently need some dev feature for dataset-viewer, I would suggest pushing the feature (cherry-picked) to a dedicated branch with 2.19.1 as its starting point (without opening a PR), and install datasets from that branch.

I have done so:

@lhoestq
Copy link
Member Author

lhoestq commented May 22, 2024

hfh 0.23.1 and transformers 4.41.0 as are out out, let's unpin no ?

@albertvillanova
Copy link
Member

I have re-run the CI to check that is green before.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is red.

@lhoestq
Copy link
Member Author

lhoestq commented May 23, 2024

The errors were coming from transformers having FutureWarning when loading models or tokenizers. I disabled the warnings for the transformers-related calls since they're not related to datasets

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation.

Do you know why a FutureWarning was transformed into an OSError? These are the raised errors:

FAILED tests/test_metric_common.py::LocalMetricTest::test_load_metric_bertscore - OSError: Can't load the configuration of 'roberta-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-large' is the correct path to a directory containing a config.json file
FAILED tests/test_fingerprint.py::TokenizersHashTest::test_hash_tokenizer - OSError: Can't load the configuration of 'bert-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'bert-base-uncased' is the correct path to a directory containing a config.json file

@albertvillanova
Copy link
Member

@lhoestq
Copy link
Member Author

lhoestq commented May 24, 2024

It's because the error from the FutureWarning happened when running cache_file() from transformers, which has some code that try/except and re-raise an OSError

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is raised because of our pytest settings:

datasets/pyproject.toml

Lines 17 to 20 in b12a2c5

# Test fails if a FutureWarning is thrown by `huggingface_hub`
filterwarnings = [
"error::FutureWarning:huggingface_hub*",
]

I find contradictory both having this setting (that transforms huggingface-hub FutureWarnings into errors) and at the same time ignoring their warnings that you implemented in 889a48d

CC: @Wauplin, who introduced the pytest settings:

@Wauplin
Copy link
Contributor

Wauplin commented May 24, 2024

Opened huggingface/transformers#31007 to fix the FutureWarning in transformers. Sorry, thought it was fixed by huggingface/transformers#30618 but clearly an oversight from my side.

Regarding the pytest config, yes I remember adding it and in general I still think it's a good idea to have it. Will be more careful next time to update transformers before huggingface_hub's release and not the other way around (first time it happens since I've set this value 😬). For a temporary fix in datasets I would rather temporarily disable the filterwarnings in datasets rather then adding filters in the test code.

@lhoestq
Copy link
Member Author

lhoestq commented May 27, 2024

alright I disabled the errors on FutureWarning, do you see anything else @albertvillanova or we can merge ?

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It is good from my side.

@lhoestq lhoestq merged commit b442aa2 into main May 27, 2024
12 checks passed
@lhoestq lhoestq deleted the unpin-hfh23 branch May 27, 2024 10:14
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005165 / 0.011353 (-0.006188) 0.003991 / 0.011008 (-0.007017) 0.064029 / 0.038508 (0.025521) 0.031578 / 0.023109 (0.008468) 0.242646 / 0.275898 (-0.033252) 0.261834 / 0.323480 (-0.061646) 0.003032 / 0.007986 (-0.004953) 0.002659 / 0.004328 (-0.001670) 0.049868 / 0.004250 (0.045618) 0.047607 / 0.037052 (0.010555) 0.250537 / 0.258489 (-0.007952) 0.289460 / 0.293841 (-0.004381) 0.027225 / 0.128546 (-0.101321) 0.010496 / 0.075646 (-0.065151) 0.208455 / 0.419271 (-0.210816) 0.036813 / 0.043533 (-0.006720) 0.243361 / 0.255139 (-0.011778) 0.267477 / 0.283200 (-0.015723) 0.020402 / 0.141683 (-0.121281) 1.117118 / 1.452155 (-0.335037) 1.154868 / 1.492716 (-0.337849)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.096796 / 0.018006 (0.078790) 0.304588 / 0.000490 (0.304098) 0.000217 / 0.000200 (0.000017) 0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.019221 / 0.037411 (-0.018190) 0.062897 / 0.014526 (0.048371) 0.076446 / 0.176557 (-0.100111) 0.124476 / 0.737135 (-0.612659) 0.079921 / 0.296338 (-0.216418)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.284442 / 0.215209 (0.069233) 2.799419 / 2.077655 (0.721764) 1.468022 / 1.504120 (-0.036098) 1.354013 / 1.541195 (-0.187182) 1.379985 / 1.468490 (-0.088505) 0.561723 / 4.584777 (-4.023054) 2.408887 / 3.745712 (-1.336825) 2.712591 / 5.269862 (-2.557271) 1.803132 / 4.565676 (-2.762544) 0.063010 / 0.424275 (-0.361265) 0.005030 / 0.007607 (-0.002577) 0.339065 / 0.226044 (0.113021) 3.373667 / 2.268929 (1.104738) 1.861569 / 55.444624 (-53.583056) 1.551357 / 6.876477 (-5.325120) 1.701885 / 2.142072 (-0.440187) 0.645685 / 4.805227 (-4.159543) 0.117915 / 6.500664 (-6.382749) 0.042656 / 0.075469 (-0.032814)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.957397 / 1.841788 (-0.884391) 11.544300 / 8.074308 (3.469992) 9.761814 / 10.191392 (-0.429578) 0.134766 / 0.680424 (-0.545658) 0.015387 / 0.534201 (-0.518814) 0.285692 / 0.579283 (-0.293591) 0.269201 / 0.434364 (-0.165163) 0.328198 / 0.540337 (-0.212140) 0.422315 / 1.386936 (-0.964621)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005333 / 0.011353 (-0.006020) 0.003638 / 0.011008 (-0.007370) 0.050503 / 0.038508 (0.011994) 0.032240 / 0.023109 (0.009130) 0.267602 / 0.275898 (-0.008296) 0.293125 / 0.323480 (-0.030355) 0.004275 / 0.007986 (-0.003710) 0.002714 / 0.004328 (-0.001615) 0.049341 / 0.004250 (0.045090) 0.040364 / 0.037052 (0.003311) 0.281096 / 0.258489 (0.022607) 0.312615 / 0.293841 (0.018774) 0.029981 / 0.128546 (-0.098565) 0.010230 / 0.075646 (-0.065416) 0.059218 / 0.419271 (-0.360054) 0.033360 / 0.043533 (-0.010172) 0.269518 / 0.255139 (0.014379) 0.287559 / 0.283200 (0.004360) 0.018159 / 0.141683 (-0.123524) 1.107148 / 1.452155 (-0.345006) 1.170731 / 1.492716 (-0.321985)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.095942 / 0.018006 (0.077936) 0.304914 / 0.000490 (0.304425) 0.000227 / 0.000200 (0.000027) 0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022609 / 0.037411 (-0.014803) 0.076455 / 0.014526 (0.061929) 0.088170 / 0.176557 (-0.088386) 0.128485 / 0.737135 (-0.608651) 0.092471 / 0.296338 (-0.203867)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.291471 / 0.215209 (0.076262) 2.822666 / 2.077655 (0.745012) 1.531679 / 1.504120 (0.027559) 1.405931 / 1.541195 (-0.135263) 1.418893 / 1.468490 (-0.049597) 0.576128 / 4.584777 (-4.008649) 0.969466 / 3.745712 (-2.776246) 2.831998 / 5.269862 (-2.437863) 1.788814 / 4.565676 (-2.776863) 0.064141 / 0.424275 (-0.360134) 0.005126 / 0.007607 (-0.002482) 0.341699 / 0.226044 (0.115654) 3.320551 / 2.268929 (1.051622) 1.903350 / 55.444624 (-53.541274) 1.611809 / 6.876477 (-5.264668) 1.729355 / 2.142072 (-0.412717) 0.654622 / 4.805227 (-4.150605) 0.118739 / 6.500664 (-6.381925) 0.041453 / 0.075469 (-0.034016)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.017635 / 1.841788 (-0.824153) 12.275948 / 8.074308 (4.201640) 10.416224 / 10.191392 (0.224832) 0.142288 / 0.680424 (-0.538135) 0.015591 / 0.534201 (-0.518610) 0.286515 / 0.579283 (-0.292768) 0.128661 / 0.434364 (-0.305703) 0.325728 / 0.540337 (-0.214609) 0.415827 / 1.386936 (-0.971109)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revert temporary pin huggingface-hub < 0.23.0
5 participants