Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

Open
HarikalarKutusu opened this issue Apr 7, 2024 · 2 comments
Labels
Bug Text Corpus Bugs or feature requests that are related to Text Corpus

Comments

@HarikalarKutusu
Copy link
Contributor

Describe the bug

This is similar to #4406 .

The reported.tsv files are broken for many languages/dataset version, i.e. they do not conform to the expected four column (sentence_id, sentence, locale, reason) structure, because of LF, CRLF, TAB like characters in cells.

These are not only caused by the text in the "sentence" field, but also in the "reason" field, where people can write their own reasons in the textarea presented to them if they select the "Other" option.

To Reproduce

It is in many files, but just open the one in v17.0 en dataset to see multiple forms.

Expected behavior

  • The sentences should be cleaned at the entry (write page & bulk)
  • The "reason" when reporting through "other" should be an input field and/or cleaned at the entry, before inserting into the database.
  • During creation of reported.tsv, both sentence and reason fields should be cleaned (one should not surround them with quotations as this will change the real values)
  • Ideally the database entries should also be cleaned (and hash values recalculated and updated in cascading manner)

Screenshots

image

Additional context

I recognized the problem after I posted #4406 . In my library I had a pandas pd.read_csv, where malformed lines were skipped. I changed it to "warn" and scanned all reported.tsv files during analysis, where it output thousands of warnings. This also means my analysis results are also wrong, at least incomplete.

This does not happen in validated, invalidated, other, train, dev, test.tsv files, as the CorporaCreator that produces them has pre-processing (cleaning) routines, which prevents it.

On the other hand, whenever a sentence is processed/cleaned after it is inserted into the database, the crypto-hash value in sentence_id becomes invalid (looses the assumed 1-1 relationship).

I'd recommend a single sentence clean-up library routine implemented in Typescript AND python and used everywhere consistently. This would also help down-stream libraries/applications to see what is done / what should be done.

@jessicarose
Copy link
Collaborator

Thank you so much for this, I'm currently researching all the text corpus bugs and working with the team on getting these triaged. This is an incredibly detailed and helpful issue, I and the rest of the team really appreciate it.

@jessicarose jessicarose added the Text Corpus Bugs or feature requests that are related to Text Corpus label Apr 15, 2024
@HarikalarKutusu
Copy link
Contributor Author

If you need to work on reported.tsv with such problem, here is my read function, which solves MOST of the problems - not ALL of them, because problems cause by two different fields can be unpredictable.

https://github.com/HarikalarKutusu/cv-tbox-dataset-compiler/blob/74a788a012fa64a241b4248805ded7a59f196bac/lib.py#L282

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Text Corpus Bugs or feature requests that are related to Text Corpus
Projects
None yet
Development

No branches or pull requests

2 participants