Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for duplicate records #67

Open
kwinkunks opened this issue Sep 23, 2023 · 2 comments
Open

Check for duplicate records #67

kwinkunks opened this issue Sep 23, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@kwinkunks
Copy link
Member

Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)

  • function
  • sklearn checker
  • pandas df accessor
@kwinkunks kwinkunks added enhancement New feature or request hacktoberfest A good issue for Hacktoberfest labels Sep 23, 2023
@bhoomikaagrawal16
Copy link

Hello, I would like to work on this. Can you elaborate more on what is expected?

@kwinkunks
Copy link
Member Author

@bhoomikaagrawal16 hello, and thanks for thinking of contributing!

I guess there's at least a couple of scenarios:

  • Duplicate rows in any dataset -- not a good thing.
  • Rows in data that appeared in training -- really not good.

There are 3 place I put things:

  • Functions in a module like duplicates.py -- I would start here
  • sklearn transformers, both supervised and unsupervised, in ``sklearn.py` (usually trying to use functions from whatever modules)
  • pandas accessors, in pandas.py (usually trying to use functions from whatever modules)

So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.

Write simple docstrings and doctests please (see the other modules).

Does this help? Let me know if you need more.

@kwinkunks kwinkunks removed the hacktoberfest A good issue for Hacktoberfest label Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants