Unpin hfh #6876

lhoestq · 2024-05-06T18:10:49Z

Needed to use those in dataset-viewer:

dev version of hfh Use huggingface-hub with paths-info fix dataset-viewer#2781: don't span the hub with /paths-info requests
dev version of datasets at Shorten long logs #6875: don't write too big logs in the viewer

HuggingFaceDocBuilderDev · 2024-05-06T18:13:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

As I explain in the corresponding issue that I self-assigned (#6863), I was planning to unpin the upper bound of huggingface-hub < 0.23.0 only when the huggingface/transformers#30618 fix is merged and released in a new transformers version, otherwise it breaks our CI. The fix has been merged but not released yet.

Also note that datasets-2.19.1 (version currently installed in the dataset-viewer) does not include the pin in huggingface-hub:
https://github.com/huggingface/datasets/blob/2.19.1/setup.py#L138

datasets/setup.py

Line 138 in bb2664c

"huggingface-hub>=0.21.2",

If we urgently need some dev feature for dataset-viewer, I would suggest pushing the feature (cherry-picked) to a dedicated branch with 2.19.1 as its starting point (without opening a PR), and install datasets from that branch.

lhoestq · 2024-05-07T12:27:13Z

transformers 4.40.2 was release yesterday but not sure if it contains the fix

albertvillanova · 2024-05-07T13:18:42Z

@lhoestq yes I knew transformers 4.40.2 was released yesterday, but I had checked that it does not contain the fix: only 2 bug fixes. That is why our CI continues failing in this PR. We will have to wait until the next minor version.

albertvillanova · 2024-05-07T13:24:07Z

If we urgently need some dev feature for dataset-viewer, I would suggest pushing the feature (cherry-picked) to a dedicated branch with 2.19.1 as its starting point (without opening a PR), and install datasets from that branch.

I have done so:

Created a branch from 2.19.1: https://github.com/huggingface/datasets/tree/datasets-2.19.1-hotfix
Cherry-picked the commit in this PR: 3638183
Opened a PR in dataset-viewer to update datasets to this revision: Update datasets 2.19.1 with hotfix to shorten long logs dataset-viewer#2783

lhoestq · 2024-05-22T16:25:18Z

hfh 0.23.1 and transformers 4.41.0 as are out out, let's unpin no ?

albertvillanova · 2024-05-23T05:53:09Z

I have re-run the CI to check that is green before.

albertvillanova

The CI is red.

lhoestq · 2024-05-23T13:29:21Z

The errors were coming from transformers having FutureWarning when loading models or tokenizers. I disabled the warnings for the transformers-related calls since they're not related to datasets

albertvillanova

Thanks for the investigation.

Do you know why a FutureWarning was transformed into an OSError? These are the raised errors:

FAILED tests/test_metric_common.py::LocalMetricTest::test_load_metric_bertscore - OSError: Can't load the configuration of 'roberta-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-large' is the correct path to a directory containing a config.json file
FAILED tests/test_fingerprint.py::TokenizersHashTest::test_hash_tokenizer - OSError: Can't load the configuration of 'bert-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'bert-base-uncased' is the correct path to a directory containing a config.json file

albertvillanova · 2024-05-24T07:51:08Z

I opened an issue in transformers:

OSError due to huggingface-hub FutureWarning about resume_download transformers#31002

lhoestq · 2024-05-24T09:47:37Z

It's because the error from the FutureWarning happened when running cache_file() from transformers, which has some code that try/except and re-raise an OSError

albertvillanova

The error is raised because of our pytest settings:

datasets/pyproject.toml

Lines 17 to 20 in b12a2c5

    
           # Test fails if a FutureWarning is thrown by `huggingface_hub` 
        
           filterwarnings = [ 
        
               "error::FutureWarning:huggingface_hub*", 
        
           ]

I find contradictory both having this setting (that transforms huggingface-hub FutureWarnings into errors) and at the same time ignoring their warnings that you implemented in 889a48d

CC: @Wauplin, who introduced the pytest settings:

Remove set_access_token usage + fail tests if FutureWarning #5623

Wauplin · 2024-05-24T10:15:15Z

Opened huggingface/transformers#31007 to fix the FutureWarning in transformers. Sorry, thought it was fixed by huggingface/transformers#30618 but clearly an oversight from my side.

Regarding the pytest config, yes I remember adding it and in general I still think it's a good idea to have it. Will be more careful next time to update transformers before huggingface_hub's release and not the other way around (first time it happens since I've set this value 😬). For a temporary fix in datasets I would rather temporarily disable the filterwarnings in datasets rather then adding filters in the test code.

This reverts commit 889a48d.

lhoestq · 2024-05-27T09:30:43Z

alright I disabled the errors on FutureWarning, do you see anything else @albertvillanova or we can merge ?

albertvillanova

Thanks. It is good from my side.

github-actions · 2024-05-27T10:20:41Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005165 / 0.011353 (-0.006188)	0.003991 / 0.011008 (-0.007017)	0.064029 / 0.038508 (0.025521)	0.031578 / 0.023109 (0.008468)	0.242646 / 0.275898 (-0.033252)	0.261834 / 0.323480 (-0.061646)	0.003032 / 0.007986 (-0.004953)	0.002659 / 0.004328 (-0.001670)	0.049868 / 0.004250 (0.045618)	0.047607 / 0.037052 (0.010555)	0.250537 / 0.258489 (-0.007952)	0.289460 / 0.293841 (-0.004381)	0.027225 / 0.128546 (-0.101321)	0.010496 / 0.075646 (-0.065151)	0.208455 / 0.419271 (-0.210816)	0.036813 / 0.043533 (-0.006720)	0.243361 / 0.255139 (-0.011778)	0.267477 / 0.283200 (-0.015723)	0.020402 / 0.141683 (-0.121281)	1.117118 / 1.452155 (-0.335037)	1.154868 / 1.492716 (-0.337849)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096796 / 0.018006 (0.078790)	0.304588 / 0.000490 (0.304098)	0.000217 / 0.000200 (0.000017)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019221 / 0.037411 (-0.018190)	0.062897 / 0.014526 (0.048371)	0.076446 / 0.176557 (-0.100111)	0.124476 / 0.737135 (-0.612659)	0.079921 / 0.296338 (-0.216418)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.284442 / 0.215209 (0.069233)	2.799419 / 2.077655 (0.721764)	1.468022 / 1.504120 (-0.036098)	1.354013 / 1.541195 (-0.187182)	1.379985 / 1.468490 (-0.088505)	0.561723 / 4.584777 (-4.023054)	2.408887 / 3.745712 (-1.336825)	2.712591 / 5.269862 (-2.557271)	1.803132 / 4.565676 (-2.762544)	0.063010 / 0.424275 (-0.361265)	0.005030 / 0.007607 (-0.002577)	0.339065 / 0.226044 (0.113021)	3.373667 / 2.268929 (1.104738)	1.861569 / 55.444624 (-53.583056)	1.551357 / 6.876477 (-5.325120)	1.701885 / 2.142072 (-0.440187)	0.645685 / 4.805227 (-4.159543)	0.117915 / 6.500664 (-6.382749)	0.042656 / 0.075469 (-0.032814)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.957397 / 1.841788 (-0.884391)	11.544300 / 8.074308 (3.469992)	9.761814 / 10.191392 (-0.429578)	0.134766 / 0.680424 (-0.545658)	0.015387 / 0.534201 (-0.518814)	0.285692 / 0.579283 (-0.293591)	0.269201 / 0.434364 (-0.165163)	0.328198 / 0.540337 (-0.212140)	0.422315 / 1.386936 (-0.964621)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005333 / 0.011353 (-0.006020)	0.003638 / 0.011008 (-0.007370)	0.050503 / 0.038508 (0.011994)	0.032240 / 0.023109 (0.009130)	0.267602 / 0.275898 (-0.008296)	0.293125 / 0.323480 (-0.030355)	0.004275 / 0.007986 (-0.003710)	0.002714 / 0.004328 (-0.001615)	0.049341 / 0.004250 (0.045090)	0.040364 / 0.037052 (0.003311)	0.281096 / 0.258489 (0.022607)	0.312615 / 0.293841 (0.018774)	0.029981 / 0.128546 (-0.098565)	0.010230 / 0.075646 (-0.065416)	0.059218 / 0.419271 (-0.360054)	0.033360 / 0.043533 (-0.010172)	0.269518 / 0.255139 (0.014379)	0.287559 / 0.283200 (0.004360)	0.018159 / 0.141683 (-0.123524)	1.107148 / 1.452155 (-0.345006)	1.170731 / 1.492716 (-0.321985)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095942 / 0.018006 (0.077936)	0.304914 / 0.000490 (0.304425)	0.000227 / 0.000200 (0.000027)	0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022609 / 0.037411 (-0.014803)	0.076455 / 0.014526 (0.061929)	0.088170 / 0.176557 (-0.088386)	0.128485 / 0.737135 (-0.608651)	0.092471 / 0.296338 (-0.203867)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291471 / 0.215209 (0.076262)	2.822666 / 2.077655 (0.745012)	1.531679 / 1.504120 (0.027559)	1.405931 / 1.541195 (-0.135263)	1.418893 / 1.468490 (-0.049597)	0.576128 / 4.584777 (-4.008649)	0.969466 / 3.745712 (-2.776246)	2.831998 / 5.269862 (-2.437863)	1.788814 / 4.565676 (-2.776863)	0.064141 / 0.424275 (-0.360134)	0.005126 / 0.007607 (-0.002482)	0.341699 / 0.226044 (0.115654)	3.320551 / 2.268929 (1.051622)	1.903350 / 55.444624 (-53.541274)	1.611809 / 6.876477 (-5.264668)	1.729355 / 2.142072 (-0.412717)	0.654622 / 4.805227 (-4.150605)	0.118739 / 6.500664 (-6.381925)	0.041453 / 0.075469 (-0.034016)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.017635 / 1.841788 (-0.824153)	12.275948 / 8.074308 (4.201640)	10.416224 / 10.191392 (0.224832)	0.142288 / 0.680424 (-0.538135)	0.015591 / 0.534201 (-0.518610)	0.286515 / 0.579283 (-0.292768)	0.128661 / 0.434364 (-0.305703)	0.325728 / 0.540337 (-0.214609)	0.415827 / 1.386936 (-0.971109)

unpin hfh

7b39054

lhoestq requested a review from albertvillanova May 6, 2024 18:14

severo approved these changes May 6, 2024

View reviewed changes

albertvillanova reviewed May 7, 2024

View reviewed changes

albertvillanova reviewed May 23, 2024

View reviewed changes

lhoestq added 2 commits May 23, 2024 14:43

Merge branch 'main' into unpin-hfh23

86953e5

ignore transformers warnings

889a48d

albertvillanova reviewed May 24, 2024

View reviewed changes

lhoestq added 2 commits May 27, 2024 11:06

Revert "ignore transformers warnings"

0364118

This reverts commit 889a48d.

disable errors

600b680

albertvillanova approved these changes May 27, 2024

View reviewed changes

lhoestq merged commit b442aa2 into main May 27, 2024
12 checks passed

lhoestq deleted the unpin-hfh23 branch May 27, 2024 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpin hfh #6876

Unpin hfh #6876

lhoestq commented May 6, 2024 •

edited

HuggingFaceDocBuilderDev commented May 6, 2024

albertvillanova left a comment •

edited

lhoestq commented May 7, 2024

albertvillanova commented May 7, 2024 •

edited

albertvillanova commented May 7, 2024

lhoestq commented May 22, 2024

albertvillanova commented May 23, 2024

albertvillanova left a comment

lhoestq commented May 23, 2024

albertvillanova left a comment

albertvillanova commented May 24, 2024

lhoestq commented May 24, 2024

albertvillanova left a comment •

edited

Wauplin commented May 24, 2024

lhoestq commented May 27, 2024

albertvillanova left a comment

github-actions bot commented May 27, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

	# Test fails if a FutureWarning is thrown by `huggingface_hub`
	filterwarnings = [
	"error::FutureWarning:huggingface_hub*",
	]

Unpin hfh #6876

Unpin hfh #6876

Conversation

lhoestq commented May 6, 2024 • edited

HuggingFaceDocBuilderDev commented May 6, 2024

albertvillanova left a comment • edited

Choose a reason for hiding this comment

lhoestq commented May 7, 2024

albertvillanova commented May 7, 2024 • edited

albertvillanova commented May 7, 2024

lhoestq commented May 22, 2024

albertvillanova commented May 23, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

lhoestq commented May 23, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova commented May 24, 2024

lhoestq commented May 24, 2024

albertvillanova left a comment • edited

Choose a reason for hiding this comment

Wauplin commented May 24, 2024

lhoestq commented May 27, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented May 27, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented May 6, 2024 •

edited

albertvillanova left a comment •

edited

albertvillanova commented May 7, 2024 •

edited

albertvillanova left a comment •

edited