[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

HarikalarKutusu · 2024-04-07T07:49:26Z

Describe the bug

This is similar to #4406 .

The reported.tsv files are broken for many languages/dataset version, i.e. they do not conform to the expected four column (sentence_id, sentence, locale, reason) structure, because of LF, CRLF, TAB like characters in cells.

These are not only caused by the text in the "sentence" field, but also in the "reason" field, where people can write their own reasons in the textarea presented to them if they select the "Other" option.

To Reproduce

It is in many files, but just open the one in v17.0 en dataset to see multiple forms.

Expected behavior

The sentences should be cleaned at the entry (write page & bulk)
The "reason" when reporting through "other" should be an input field and/or cleaned at the entry, before inserting into the database.
During creation of reported.tsv, both sentence and reason fields should be cleaned (one should not surround them with quotations as this will change the real values)
Ideally the database entries should also be cleaned (and hash values recalculated and updated in cascading manner)

Screenshots

Additional context

I recognized the problem after I posted #4406 . In my library I had a pandas pd.read_csv, where malformed lines were skipped. I changed it to "warn" and scanned all reported.tsv files during analysis, where it output thousands of warnings. This also means my analysis results are also wrong, at least incomplete.

This does not happen in validated, invalidated, other, train, dev, test.tsv files, as the CorporaCreator that produces them has pre-processing (cleaning) routines, which prevents it.

On the other hand, whenever a sentence is processed/cleaned after it is inserted into the database, the crypto-hash value in sentence_id becomes invalid (looses the assumed 1-1 relationship).

I'd recommend a single sentence clean-up library routine implemented in Typescript AND python and used everywhere consistently. This would also help down-stream libraries/applications to see what is done / what should be done.

The text was updated successfully, but these errors were encountered:

jessicarose · 2024-04-15T09:08:07Z

Thank you so much for this, I'm currently researching all the text corpus bugs and working with the team on getting these triaged. This is an incredibly detailed and helpful issue, I and the rest of the team really appreciate it.

HarikalarKutusu · 2024-04-15T09:31:12Z

If you need to work on reported.tsv with such problem, here is my read function, which solves MOST of the problems - not ALL of them, because problems cause by two different fields can be unpredictable.

https://github.com/HarikalarKutusu/cv-tbox-dataset-compiler/blob/74a788a012fa64a241b4248805ded7a59f196bac/lib.py#L282

HarikalarKutusu added the Bug label Apr 7, 2024

HarikalarKutusu mentioned this issue Apr 8, 2024

[PR] Post CV v17.0 work HarikalarKutusu/cv-tbox-dataset-compiler#35

Merged

jessicarose added the Text Corpus Bugs or feature requests that are related to Text Corpus label Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

HarikalarKutusu commented Apr 7, 2024

jessicarose commented Apr 15, 2024

HarikalarKutusu commented Apr 15, 2024

[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

Comments

HarikalarKutusu commented Apr 7, 2024

jessicarose commented Apr 15, 2024

HarikalarKutusu commented Apr 15, 2024