Add function to check duplicated names in DWCA fields and rename them #81

zedomel · 2019-12-03T15:04:50Z

While using this great python package I ran into a problem when reading some DWC-A with duplicated fields. I know that it is an formatting error of those DWC-A, but I still want to be able to read them.
Just to you know here a example of one these archives with duplicated fields: Allan Herbarium (CHR).

The solution that I found was to adapt the pandas _maybe_dedup_names method from pandas/pandas/io/parsers.py to rename duplicated field's names.

The pandas.read_csv method allows passing namesarguments, and as stated in documentation names parameter must not have duplicated values:

Duplicates in this list are not allowed. Documentation

So, the adapted method _maybe_dedup_namesjust check for duplicates and rename them adding a sequential number: "X.1, X.2, ... X.N" as expected when the parameter mangle_dupe_cols is set to Trueand no names argument is provided to pandas read_csv method.

But, as python-dwca-reader ignores kwargs names and use the qq names from DWC-A meta file, it will throw an exception when names (a.k.a DWC-A fields) are not unique:

Traceback (most recent call last): File "dwca-reader.py", line 39, in <module> ext_df = dwca.pd_read(e.file_descriptor.file_location, parse_dates = False, mangle_dupe_cols = True) File "/home/jose/.local/lib/python3.6/site-packages/dwca/read.py", line 198, in pd_read df = read_csv(self.absolute_temporary_path(relative_path), **kwargs) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read _validate_names(kwds.get("names", None)) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 421, in _validate_names raise ValueError("Duplicate names are not allowed.") ValueError: Duplicate names are not allowed.

Maybe a better solution should merge duplicated fields, but for now using mangle_dupe_cols = False is not supported by pandas and the solution is being considering complex to implement into pandas (please, see Pandas Issue 13262). But, maybe for the scope of this package the merge solution should be easier to implement.

Please, let me know if I could help developing a merging solution if you agree that it will be better than just rename the duplicated fields.

Thanks.

best regards.

duplicated names using sequence numbers

csbrown · 2023-11-08T18:00:27Z

Probably there should be pytest tests for this?

Add function to check duplicated names in DWCA fields and rename the

04343b3

duplicated names using sequence numbers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to check duplicated names in DWCA fields and rename them #81

Add function to check duplicated names in DWCA fields and rename them #81

zedomel commented Dec 3, 2019

csbrown commented Nov 8, 2023

Add function to check duplicated names in DWCA fields and rename them #81

Are you sure you want to change the base?

Add function to check duplicated names in DWCA fields and rename them #81

Conversation

zedomel commented Dec 3, 2019

csbrown commented Nov 8, 2023