Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: adding parquet.votable #16375

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Conversation

bsipocz
Copy link
Member

@bsipocz bsipocz commented May 2, 2024

This PR adds a new format parquet.votable to be able to read parquet files that include a votable metadata.

I open this as a draft for the first round of comments, and while adding some narrative documentation.

cc @tomdonaldson @gpdf @afaisst

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

Copy link

github-actions bot commented May 2, 2024

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

Copy link

github-actions bot commented May 2, 2024

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@bsipocz
Copy link
Member Author

bsipocz commented May 2, 2024

Fixed up the codestyle advisories, but two remained and are related to XML. How do we ended up standing with defusedxml? Should I just ignore these for now?


# Convert the VOTable object into a Byte string to create an
# XML that we can add to the Parquet metadata
xml_bstr = io.BytesIO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a byte string? The votable content will already be utf-8. Are the bytes needed for parquet?

Data table that is to be written to output.
output : str or path-like
The filename to write the table to.
metadata : dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user expected to generate this metadata object? How does it relate to the .meta info that may already be in the table or its columns?

It looks like this metadata is intended to supplement what the votable writer supplied from the table. I'm concerned that, especially if table was created from reading an actual votable, that metadata could conflict with the existing metadata in table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have issues with this kwarg, too, and wrote up a few test cases. E.g. I think we should reuse the column metadata if the Table already has it (e.g. the table could have been read in from a votable/be a result of an astroquery query). So I would instead add another kwarg for metadata_overwrite, and then use the metadata kward to overwrite some of the column metadata (but still, it would be nice to allow partial overwrite).

As far as I see, we do nothing with the table metadata, but that is already the case when reading in from a votable? (E.g. with this gaia file, it's not clear how to get the e.g. QUERY info out of it, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in votable.parquet we call this column_metadata, so I suppose we need to do the same. And it's similarly a mandatory kwarg, which IMO needs to be changed and allowed it to use existing metadata if/when it's sufficiently provided.

mass = np.random.uniform(low=1e8, high=1e10, size=number_of_objects)
sfr = np.random.uniform(low=1, high=100, size=number_of_objects)

input_table = Table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to have test cases where the input astropy table was created by reading an actual votable with its own metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test example for this

@bsipocz
Copy link
Member Author

bsipocz commented May 22, 2024

@pllim - WIW, this again run into some coverage shenanigans, something else than the previous upload fail.

@pllim
Copy link
Member

pllim commented May 23, 2024

You have to rebase anyway, I think, to pick up fix for RTD. Also maybe address the pre-commit checks? I don't know why there are two different codecov/project statuses here. Hopefully it would make more sense once you rebase. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants