Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WebDataset containing file basenames with dots #6888

Closed
wants to merge 4 commits into from

Conversation

albertvillanova
Copy link
Member

Support WebDataset containing file basenames with dots.

Fix #6880.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented May 10, 2024

I think webdataset splits the file name and extension using the first dot no ?

https://github.com/webdataset/webdataset/blob/945b251a872ec0d337be8f9ea17a9c5b0d017ff3/webdataset/tariterators.py#L226

links to this function that splits on first dot

def base_plus_ext(path):
    """Split off all file extensions.

    Returns base, allext.

    Args:
        path: path with extensions

    Returns:
        path with all extensions removed
    """
    match = re.match(r"^((?:.*/|)[^.]+)[.]([^/]*)$", path)
    if not match:
        return None, None
    return match.group(1), match.group(2)

@lhoestq
Copy link
Member

lhoestq commented May 10, 2024

So maybe the original issue is actually due to one of the files containing a dot in its file name that is not for the extension

>>> base_plus_ext("15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png")
('15_Cohen_1-s2', '0-S0929664620300449-gr3_lrg-b.png')

@albertvillanova
Copy link
Member Author

Thanks for your review, @lhoestq.

I was not aware that webdataset requires filenames without dots in their basenames.

@lhoestq
Copy link
Member

lhoestq commented May 10, 2024

I they can have dots for the extension (that becomes the column name) but not in the key used to group files into samples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Webdataset: KeyError: 'png' on some datasets when streaming
3 participants