Check for duplicate records #67

kwinkunks · 2023-09-23T10:22:11Z

Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)

function
sklearn checker
pandas df accessor

bhoomikaagrawal16 · 2023-10-26T10:35:59Z

Hello, I would like to work on this. Can you elaborate more on what is expected?

kwinkunks · 2023-10-27T13:01:48Z

@bhoomikaagrawal16 hello, and thanks for thinking of contributing!

I guess there's at least a couple of scenarios:

Duplicate rows in any dataset -- not a good thing.
Rows in data that appeared in training -- really not good.

There are 3 place I put things:

Functions in a module like duplicates.py -- I would start here
sklearn transformers, both supervised and unsupervised, in ``sklearn.py` (usually trying to use functions from whatever modules)
pandas accessors, in pandas.py (usually trying to use functions from whatever modules)

So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.

Write simple docstrings and doctests please (see the other modules).

Does this help? Let me know if you need more.

kwinkunks added enhancement New feature or request hacktoberfest A good issue for Hacktoberfest labels Sep 23, 2023

kwinkunks removed the hacktoberfest A good issue for Hacktoberfest label Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for duplicate records #67

Check for duplicate records #67

kwinkunks commented Sep 23, 2023

bhoomikaagrawal16 commented Oct 26, 2023

kwinkunks commented Oct 27, 2023

Check for duplicate records #67

Check for duplicate records #67

Comments

kwinkunks commented Sep 23, 2023

bhoomikaagrawal16 commented Oct 26, 2023

kwinkunks commented Oct 27, 2023