BUG: Add np.uintc to _factorizers in merge.py to fix KeyError when merging DataFrames with uintc columns #58727

Tirthchoksi22 · 2024-05-15T07:09:11Z

closes BUG: pd.merge fail with numpy.uintc on Windows #58713 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

… DataFrames with uintc columns

Tirthchoksi22 · 2024-05-15T09:11:13Z

hiii @myles the PR is ready to merge

simonjayhawkins · 2024-05-15T09:15:09Z

thanks @Tirthchoksi22

the same issue was fixed for np.intc in #53175.

This is only for Windows and is it a regression from a previous release?

if so could probably use the previous fix/tests/release note for this PR?

Tirthchoksi22 · 2024-05-15T09:44:22Z

Yes, it appears that there is indeed a regression on Windows machines. The issue seems to have resurfaced after being resolved in a previous release.

thanks @Tirthchoksi22

the same issue was fixed for np.intc in #53175.

This is only for Windows and is it a regression from a previous release?

if so could probably use the previous fix/tests/release note for this PR?

Tirthchoksi22 · 2024-05-15T09:52:55Z

@simonjayhawkins After reviewing the code and considering the previous fix, it appears that the solution implemented in this pull request is indeed identical to the one used in the previous resolution of the regression.

thanks @Tirthchoksi22

the same issue was fixed for np.intc in #53175.

This is only for Windows and is it a regression from a previous release?

if so could probably use the previous fix/tests/release note for this PR?

Tirthchoksi22 · 2024-05-15T09:55:01Z

@simonjayhawkins Also guide me what to do next do i have to create new PR with previous fix/tests/release or this is ok ??as this would be my first Open Source Contribution

simonjayhawkins · 2024-05-15T10:05:23Z

you can make changes to you branch and push those changes to update this PR. No need to close and open a new PR.

All bug fixes and regression fixes need a test to ensure that the issue does not resurface here. In this case, looking at the fix for np.intc the parameterization of test_join_multi_dtypes was updated. Perhaps could do the same here?

For the release note, if you (I can't check a windows only bug easily) determine for which pandas release it was a regression, or if that is many releases ago, then we maybe need a note either under the bug fixes or regression section of the release notes. If a bug fix would go in the 3.0 what's new. If a regression, would go in the 2.2.x release notes (but that depends if we are doing another patch release)

cc @lithomas1

Tirthchoksi22 · 2024-05-15T10:45:41Z

@simonjayhawkins Thank you for your feedback. Upon further review, I realized that the changes made to the merge.py file were minimal and consisted of adding just one extra line (np.uintc: libhashtable.UInt32Factorizer) to the _factorizers dictionary. The majority of the changes were indeed in the test file, where I added tests to ensure proper handling of numpy.uintc columns.

Additionally, I'd like to confirm that the release note has already been included in the bug fixes and regression section, documenting the resolution of the issue.

Given this information, I believe that the changes made align with the expectations outlined in your feedback. Please let me know if there are any further adjustments or considerations needed.

Thank you for your continued guidance and support throughout this process.

simonjayhawkins · 2024-05-15T11:00:19Z

ci is failing,

/pre-commit

Tirthchoksi22 · 2024-05-15T11:03:14Z

ci is failing,

/pre-commit

so what to do now ??

simonjayhawkins · 2024-05-15T11:06:32Z

I thought that comment should have triggered the bot.

simonjayhawkins · 2024-05-15T11:06:42Z

/pre-commit

simonjayhawkins · 2024-05-15T11:14:23Z

@github-actions pre-commit

Tirthchoksi22 · 2024-05-15T11:16:54Z

/pre-commit

I dont think there's a command like this

simonjayhawkins · 2024-05-15T11:20:52Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

simonjayhawkins · 2024-05-15T11:33:04Z

pandas/tests/reshape/merge/test_merge.py

    @pytest.mark.parametrize("d2", [np.int64, np.float64, np.float32, np.float16])
-    def test_join_multi_dtypes(self, any_int_numpy_dtype, d2):
-        dtype1 = np.dtype(any_int_numpy_dtype)
+    def test_join_multi_dtypes(self, d1, d2):


So it appears that this has been updated since the last fix.

I doubt that this should be changed. it's probably that any_int_numpy_dtype (and therefore tm.UNSIGNED_INT_NUMPY_DTYPES ) should be updated.

However, tm.UNSIGNED_INT_NUMPY_DTYPES does not appear to contain np.intc either. (maybe how the regression was uncaught)

So looks like will need to add both np.intc and np.uintc to tm.UNSIGNED_INT_NUMPY_DTYPES

ok i will add it in tm.UNSIGNED_INT_NUMPY_DTYPES

Can you explain the issue to me (as I don't have the platform to test)

Although my previous comment was to align better with the previous fix, I don't know why we need to explicitly cater for np.intc and np.uintc since we don't need to for the other Python API “C-like” names such as numpy.short.

So adding these does not really make sense to me as I don't really understand the issue.

@simonjayhawkins what i think is the issue arises because certain functionalities within pandas were not properly handling np.intc and np.uintc data types. As a result, when these data types were encountered, it led to unexpected keyerrors.that's what i feel but don't know why we don't need to mention this in other Python API's

@simonjayhawkins I still didnt understand why regression occur on windows side perhaps compatibility issue ???

@simonjayhawkins if you think to change this according to last fix then tell me as this fix will also work

functionalities within pandas were not properly handling np.intc

well it works on other platforms, so it maybe not a pandas issue?

Tirthchoksi22 · 2024-05-15T12:38:29Z

@simonjayhawkins Could you kindly confirm if the changes proposed in this pull request are satisfactory, or if there are any additional adjustments needed before merging?

simonjayhawkins

needs a release note. put in 3.0.0 notes for now.

pandas/core/reshape/merge.py

simonjayhawkins · 2024-05-15T12:59:17Z

pandas/tests/reshape/merge/test_merge.py

    @pytest.mark.parametrize("d2", [np.int64, np.float64, np.float32, np.float16])
-    def test_join_multi_dtypes(self, any_int_numpy_dtype, d2):
-        dtype1 = np.dtype(any_int_numpy_dtype)
+    def test_join_multi_dtypes(self, d1, d2):


functionalities within pandas were not properly handling np.intc

well it works on other platforms, so it maybe not a pandas issue?

pandas/tests/reshape/merge/test_merge.py

Tirthchoksi22 · 2024-05-15T13:45:13Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

cmjcharlton · 2024-05-15T13:56:43Z

I was browsing recent pull requests and have a Windows machine so thought I would give this a quick test for you. It looks like the difference is that on Windows np.uintc is not aliased to any of the types already in the _factorizers dictionary:

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.uintc is np.int64
False
>>> np.uintc is np.longlong
False
>>> np.uintc is np.int32
False
>>> np.uintc is np.int16
False
>>> np.uintc is np.int8
False
>>> np.uintc is np.uint64
False
>>> np.uintc is np.uint32
False
>>> np.uintc is np.uint16
False
>>> np.uintc is np.uint8
False
>>> np.uintc is np.bool_
False
>>> np.uintc is np.float64
False
>>> np.uintc is np.float32
False
>>> np.uintc is np.complex64
False
>>> np.uintc is np.complex128
False
>>> np.uintc is np.object_
False

Whereas at least in Linux it is aliased to the np.uint32 type:

Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.uintc is np.int64
False
>>> np.uintc is np.longlong
False
>>> np.uintc is np.int32
False
>>> np.uintc is np.int16
False
>>> np.uintc is np.int8
False
>>> np.uintc is np.uint64
False
>>> np.uintc is np.uint32
True
>>> np.uintc is np.uint16
False
>>> np.uintc is np.uint8
False
>>> np.uintc is np.bool_
False
>>> np.uintc is np.float64
False
>>> np.uintc is np.float32
False
>>> np.uintc is np.complex64
False
>>> np.uintc is np.complex128
False
>>> np.uintc is np.object_
False

This is probably related to Windows using the LLP64 64-bit data model vs LP64 used elsewhere (https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models). https://stackoverflow.com/questions/76155091/np-uint32-np-uintc-on-windows gives a possible explanation.

Tirthchoksi22 · 2024-05-16T17:45:59Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

simonjayhawkins · 2024-05-16T17:52:19Z

@Tirthchoksi22 on a procedural note, the pre-commit.ci autofix is useful for maintainers to correct the code. As the PR author, you can install pre-commit in your local dev environment to check the code changes you make before pushing them to the PR.

Tirthchoksi22 · 2024-05-16T17:55:10Z

@Tirthchoksi22 on a procedural note, the pre-commit.ci autofix is useful for maintainers to correct the code. As the PR author, you can install pre-commit in your local dev environment to check the code changes you make before pushing them to the PR.

ok sorry I didn't know about that

Tirthchoksi22 · 2024-05-16T17:56:13Z

Can you please use that comment now because i didnt have install pre-commit in my local environment right now

simonjayhawkins · 2024-05-16T17:58:54Z

Can you please use that comment now because i didnt have install pre-commit in my local environment right now

feel free to issue that command yourself. I was not suggesting that you could not use it. (However, if you look at the pandas documentation, you will see a section for setting up a development environment)

pandas/core/reshape/merge.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

Tirthchoksi22 · 2024-05-16T18:24:24Z

pre-commit.ci autofix

simonjayhawkins · 2024-05-16T18:38:01Z

doc/source/whatsnew/v3.0.0.rst

@@ -474,6 +474,7 @@ Groupby/resample/rolling
 Reshaping
 ^^^^^^^^^
 - Bug in :meth:`DataFrame.join` inconsistently setting result index name (:issue:`55815`)
+- Fixed issue in `pd.merge` (`#58713`) where merging DataFrames with `np.intc` or `np.uintc` data types caused unexpected behavior or errors. Comprehensive testing now ensures consistent behavior across diverse data type combinations, enhancing stability and robustness of data merging operations.


can you format like the notes around it.

start with "Bug in ...", keep the description short (one line) but be specific (mention "Windows"), and end with (:issue:`58713`)

Tirthchoksi22 · 2024-05-16T19:27:18Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

Tirthchoksi22 · 2024-05-17T17:03:04Z

@simonjayhawkins @WillAyd @mroeschke Guys any update ???

Tirthchoksi22 · 2024-05-22T14:52:34Z

@simonjayhawkins ??

cmjcharlton · 2024-05-23T15:07:35Z

I thought that it might be interesting to look at the differences in aliases between Windows and Linux for these integer types:

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from typing import reveal_type
>>> import numpy as np
>>> reveal_type(np.uintc(0))
Runtime type is 'uintc'
0
>>> reveal_type(np.intc(0))
Runtime type is 'intc'
0
>>> reveal_type(np.uint(0))
Runtime type is 'uint32'
0
>>> reveal_type(np.longlong(0))
Runtime type is 'int64'
0
>>> reveal_type(np.ulonglong(0))
Runtime type is 'uint64'
0

Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from typing import reveal_type
>>> import numpy as np
>>> reveal_type(np.uintc(0))
Runtime type is 'uint32'
0
>>> reveal_type(np.intc(0))
Runtime type is 'int32'
0
>>> reveal_type(np.uint(0))
Runtime type is 'uint64'
0
>>> reveal_type(np.longlong(0))
Runtime type is 'longlong'
0
>>> reveal_type(np.ulonglong(0))
Runtime type is 'ulonglong'
0

This implies that you will encounter the same behaviour as the initial bug report on Linux, but not Windows if the ulonglong type is used as this isn't aliased on that platform, and this is indeed the case:

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd
>>> df1 = pd.DataFrame({'a':['foo','bar'],'b':np.array([1,2], dtype=np.ulonglong)})
>>> df2 = pd.DataFrame({'a':['foo','baz'],'b':np.array([3,4], dtype=np.ulonglong)})
>>> df3=df1.merge(df2, how = 'outer')
>>> print(df3)
     a  b
0  bar  2
1  baz  4
2  foo  1
3  foo  3

Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd
>>> df1 = pd.DataFrame({'a':['foo','bar'],'b':np.array([1,2], dtype=np.ulonglong)})
>>> df2 = pd.DataFrame({'a':['foo','baz'],'b':np.array([3,4], dtype=np.ulonglong)})
>>> df3=df1.merge(df2, how = 'outer')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 10487, in merge
    return merge(
           ^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 183, in merge
    return op.get_result(copy=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 883, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1133, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1105, in _get_join_indexers
    return get_join_indexers(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1703, in get_join_indexers
    zipped = zip(*mapped)
             ^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1700, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 2477, in _factorize_keys
    klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 2556, in _convert_arrays_and_get_rizer_klass
    klass = _factorizers[lk.dtype.type]
            ~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: <class 'numpy.ulonglong'>

This also suggests that if you were following the same pattern as with intc/uintc that the longlong assignment in _factorizers should be conditional.

Add np.uintc to _factorizers in merge.py to fix KeyError when merging…

1774489

… DataFrames with uintc columns

Tirthchoksi22 mentioned this pull request May 15, 2024

BUG: pd.merge fail with numpy.uintc on Windows #58713

Open

3 tasks

Tirthchoksi22 changed the title ~~Add np.uintc to _factorizers in merge.py to fix KeyError when merging…~~ BUG FIXED: Add np.uintc to _factorizers in merge.py to fix KeyError when merging… May 15, 2024

simonjayhawkins added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Windows Windows OS labels May 15, 2024

add np.uintc to _factorizers in merge.py

5372107

simonjayhawkins changed the title ~~BUG FIXED: Add np.uintc to _factorizers in merge.py to fix KeyError when merging…~~ BUG: Add np.uintc to _factorizers in merge.py to fix KeyError when merging DataFrames with uintc columns May 15, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d75d94

for more information, see https://pre-commit.ci

simonjayhawkins reviewed May 15, 2024

View reviewed changes

simonjayhawkins requested changes May 15, 2024

View reviewed changes

Tirthchoksi22 added 2 commits May 15, 2024 19:11

changes according to review

0f79322

Merge branch 'main' of https://github.com/Tirthchoksi22/pandas

2c18b6b

[pre-commit.ci] auto fixes from pre-commit.com hooks

1373e05

for more information, see https://pre-commit.ci

pre-commit-ci bot and others added 2 commits May 16, 2024 17:47

[pre-commit.ci] auto fixes from pre-commit.com hooks

90e0b93

for more information, see https://pre-commit.ci

indentation change

1b9e3d0

Tirthchoksi22 added 2 commits May 16, 2024 23:31

indentation error solved

4da0b86

error solved

95bca2c

WillAyd requested changes May 16, 2024

View reviewed changes

pandas/core/reshape/merge.py Outdated Show resolved Hide resolved

Update pandas/core/reshape/merge.py

a37151e

Co-authored-by: William Ayd <william.ayd@icloud.com>

Tirthchoksi22 added 2 commits May 17, 2024 00:06

update

c5a3ccc

Merge branch 'main' of https://github.com/Tirthchoksi22/pandas

402c33c

simonjayhawkins reviewed May 16, 2024

View reviewed changes

Tirthchoksi22 and others added 4 commits May 17, 2024 00:14

update as said

7438297

upadte

9105c97

Merge branch 'main' into main

e3b76a1

update

8506f78

[pre-commit.ci] auto fixes from pre-commit.com hooks

621c8d1

for more information, see https://pre-commit.ci

Tirthchoksi22 requested review from simonjayhawkins and mroeschke May 17, 2024 09:59

Tirthchoksi22 closed this May 19, 2024

Tirthchoksi22 force-pushed the main branch from c27f733 to 593113a Compare May 19, 2024 08:36

Merge branch 'main' of https://github.com/Tirthchoksi22/pandas

50fe143

Tirthchoksi22 reopened this May 19, 2024

BUG: Add np.uintc to _factorizers in merge.py to fix KeyError when merging DataFrames with uintc columns #58727

Are you sure you want to change the base?

BUG: Add np.uintc to _factorizers in merge.py to fix KeyError when merging DataFrames with uintc columns #58727

Conversation

Tirthchoksi22 commented May 15, 2024 • edited

Tirthchoksi22 commented May 15, 2024

simonjayhawkins commented May 15, 2024

Tirthchoksi22 commented May 15, 2024

Tirthchoksi22 commented May 15, 2024

Tirthchoksi22 commented May 15, 2024

simonjayhawkins commented May 15, 2024

Tirthchoksi22 commented May 15, 2024

simonjayhawkins commented May 15, 2024

Tirthchoksi22 commented May 15, 2024

simonjayhawkins commented May 15, 2024

simonjayhawkins commented May 15, 2024

simonjayhawkins commented May 15, 2024

Tirthchoksi22 commented May 15, 2024

simonjayhawkins commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tirthchoksi22 May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tirthchoksi22 commented May 15, 2024

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tirthchoksi22 commented May 15, 2024

cmjcharlton commented May 15, 2024

Tirthchoksi22 commented May 16, 2024

simonjayhawkins commented May 16, 2024

Tirthchoksi22 commented May 16, 2024

Tirthchoksi22 commented May 16, 2024

simonjayhawkins commented May 16, 2024

Tirthchoksi22 commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tirthchoksi22 commented May 16, 2024

Tirthchoksi22 commented May 17, 2024

Tirthchoksi22 commented May 22, 2024

cmjcharlton commented May 23, 2024

Tirthchoksi22 commented May 15, 2024 •

edited

Tirthchoksi22 May 15, 2024 •

edited