Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: improve pd.io.json_normalize #57811

Closed
wants to merge 42 commits into from
Closed

Conversation

slavanorm
Copy link

@slavanorm slavanorm commented Mar 11, 2024

@datapythonista
Copy link
Member

@slavanorm in case you haven't seen it, your changes are making the tests fail: https://github.com/pandas-dev/pandas/actions/runs/8239141116/job/22531710877?pr=57811#step:8:53

@datapythonista datapythonista added Bug IO JSON read_json, to_json, json_normalize labels Mar 12, 2024
@WillAyd
Copy link
Member

WillAyd commented Mar 13, 2024

I'm not sure about this change - the point of the errors argument is to ignore missing keys. Shouldn't the test case you added still create the column but with all empty data?

@slavanorm
Copy link
Author

yes it should but it creates rows only for the dictionaries with record path.
i will edit the fixture and assertion in order to get into this case of if

@slavanorm
Copy link
Author

This docs check is failing, and its not fixable. Wonder what should we do now

doc/source/whatsnew/v2.2.2.rst Outdated Show resolved Hide resolved
pandas/tests/io/json/test_normalize.py Show resolved Hide resolved
Copy link
Author

@slavanorm slavanorm Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
please see lines 449-452. this code is highly related to errors=ignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea but why should it not work for errors="raise"? The errors argument is specifically for testing the presence of a key, which exists here. My major point is that we are mixing up two different types of errors if we don't separate the concepts of missing keys versus empty values

Copy link
Author

@slavanorm slavanorm Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it is not very different for current code when we traverse and make early stop returning n/a.
I guess you propose raising errors only for missing keys? And missing values should be always safe?

I guess if we make such a feature, it should provide choice to user, which case should raise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think the error should just be for missing keys. There should be no such thing as a missing value in well-formed JSON

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to point that key or value is arbitrary in our case.
value is just the end of json walking we defined in meta or record_path

@slavanorm slavanorm requested a review from WillAyd March 14, 2024 19:46
@slavanorm
Copy link
Author

I fixed the code, wish someone reviewed it

pandas/tests/io/json/test_normalize.py Outdated Show resolved Hide resolved
@slavanorm slavanorm requested a review from WillAyd March 28, 2024 10:17
@slavanorm slavanorm reopened this Apr 1, 2024
@slavanorm
Copy link
Author

sorry i closed it by mistake, just reopened it

@slavanorm
Copy link
Author

@WillAyd could you please review the code again, it's been 2 weeks i think

@@ -663,6 +676,22 @@ def test_missing_nested_meta(self):
errors="raise",
)

def test_missing_nested_meta_traverse_empty_list_errors_ignore(self):
# If errors="ignore" and nested metadata is nullable, return nan
data = {"meta": "foo", "nested_meta": [], "value": [{"rec": 1}, {"rec": 2}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = {"meta": "foo", "nested_meta": [], "value": [{"rec": 1}, {"rec": 2}]}
data = {"meta": "foo", "nested_meta": {}, "value": [{"rec": 1}, {"rec": 2}]}

I have lost track a bit on what we are trying to accomplish but this doesn't seem right to ignore an error when the meta argument starts peering into in an array instead of an object. Is this a TypeError on main or a KeyError? I feel like we are mixing the two up when we should not be

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this May 15, 2024
@slavanorm
Copy link
Author

thank you too.
now the code is obsolete vs main branch but its ok.

on second thought, the code does not actually require merging to pandas library.
it got stale because of lack of interaction.
everything was done on my side and all requests were adressed, but were not reviewed on time.

best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.json_normalize improvement
4 participants