You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)
function
sklearn checker
pandas df accessor
The text was updated successfully, but these errors were encountered:
Duplicate rows in any dataset -- not a good thing.
Rows in data that appeared in training -- really not good.
There are 3 place I put things:
Functions in a module like duplicates.py -- I would start here
sklearn transformers, both supervised and unsupervised, in ``sklearn.py` (usually trying to use functions from whatever modules)
pandas accessors, in pandas.py (usually trying to use functions from whatever modules)
So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.
Write simple docstrings and doctests please (see the other modules).
Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)
The text was updated successfully, but these errors were encountered: