Add function to check duplicated names in DWCA fields and rename them #81
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @niconoe,
While using this great python package I ran into a problem when reading some DWC-A with duplicated fields. I know that it is an formatting error of those DWC-A, but I still want to be able to read them.
Just to you know here a example of one these archives with duplicated fields: Allan Herbarium (CHR).
The solution that I found was to adapt the pandas
_maybe_dedup_names
method from pandas/pandas/io/parsers.py to rename duplicated field's names.The
pandas.read_csv
method allows passingnames
arguments, and as stated in documentationnames
parameter must not have duplicated values:So, the adapted method
_maybe_dedup_names
just check for duplicates and rename them adding a sequential number: "X.1, X.2, ... X.N" as expected when the parametermangle_dupe_cols
is set toTrue
and nonames
argument is provided to pandasread_csv
method.But, as python-dwca-reader ignores kwargs
names
and use the qq names from DWC-A meta file, it will throw an exception whennames
(a.k.a DWC-A fields) are not unique:Traceback (most recent call last): File "dwca-reader.py", line 39, in <module> ext_df = dwca.pd_read(e.file_descriptor.file_location, parse_dates = False, mangle_dupe_cols = True) File "/home/jose/.local/lib/python3.6/site-packages/dwca/read.py", line 198, in pd_read df = read_csv(self.absolute_temporary_path(relative_path), **kwargs) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read _validate_names(kwds.get("names", None)) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 421, in _validate_names raise ValueError("Duplicate names are not allowed.") ValueError: Duplicate names are not allowed.
Maybe a better solution should merge duplicated fields, but for now using
mangle_dupe_cols = False
is not supported by pandas and the solution is being considering complex to implement into pandas (please, see Pandas Issue 13262). But, maybe for the scope of this package the merge solution should be easier to implement.Please, let me know if I could help developing a merging solution if you agree that it will be better than just rename the duplicated fields.
Thanks.
best regards.