Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to check duplicated names in DWCA fields and rename them #81

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zedomel
Copy link

@zedomel zedomel commented Dec 3, 2019

Hi @niconoe,

While using this great python package I ran into a problem when reading some DWC-A with duplicated fields. I know that it is an formatting error of those DWC-A, but I still want to be able to read them.
Just to you know here a example of one these archives with duplicated fields: Allan Herbarium (CHR).

The solution that I found was to adapt the pandas _maybe_dedup_names method from pandas/pandas/io/parsers.py to rename duplicated field's names.

The pandas.read_csv method allows passing namesarguments, and as stated in documentation names parameter must not have duplicated values:

Duplicates in this list are not allowed. Documentation

So, the adapted method _maybe_dedup_namesjust check for duplicates and rename them adding a sequential number: "X.1, X.2, ... X.N" as expected when the parameter mangle_dupe_cols is set to Trueand no names argument is provided to pandas read_csv method.

But, as python-dwca-reader ignores kwargs names and use the qq names from DWC-A meta file, it will throw an exception when names (a.k.a DWC-A fields) are not unique:

Traceback (most recent call last): File "dwca-reader.py", line 39, in <module> ext_df = dwca.pd_read(e.file_descriptor.file_location, parse_dates = False, mangle_dupe_cols = True) File "/home/jose/.local/lib/python3.6/site-packages/dwca/read.py", line 198, in pd_read df = read_csv(self.absolute_temporary_path(relative_path), **kwargs) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read _validate_names(kwds.get("names", None)) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 421, in _validate_names raise ValueError("Duplicate names are not allowed.") ValueError: Duplicate names are not allowed.

Maybe a better solution should merge duplicated fields, but for now using mangle_dupe_cols = False is not supported by pandas and the solution is being considering complex to implement into pandas (please, see Pandas Issue 13262). But, maybe for the scope of this package the merge solution should be easier to implement.

Please, let me know if I could help developing a merging solution if you agree that it will be better than just rename the duplicated fields.

Thanks.

best regards.

@csbrown
Copy link

csbrown commented Nov 8, 2023

Probably there should be pytest tests for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants