You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reported.tsv files are broken for many languages/dataset version, i.e. they do not conform to the expected four column (sentence_id, sentence, locale, reason) structure, because of LF, CRLF, TAB like characters in cells.
These are not only caused by the text in the "sentence" field, but also in the "reason" field, where people can write their own reasons in the textarea presented to them if they select the "Other" option.
To Reproduce
It is in many files, but just open the one in v17.0 en dataset to see multiple forms.
Expected behavior
The sentences should be cleaned at the entry (write page & bulk)
The "reason" when reporting through "other" should be an input field and/or cleaned at the entry, before inserting into the database.
During creation of reported.tsv, both sentence and reason fields should be cleaned (one should not surround them with quotations as this will change the real values)
Ideally the database entries should also be cleaned (and hash values recalculated and updated in cascading manner)
Screenshots
Additional context
I recognized the problem after I posted #4406 . In my library I had a pandas pd.read_csv, where malformed lines were skipped. I changed it to "warn" and scanned all reported.tsv files during analysis, where it output thousands of warnings. This also means my analysis results are also wrong, at least incomplete.
This does not happen in validated, invalidated, other, train, dev, test.tsv files, as the CorporaCreator that produces them has pre-processing (cleaning) routines, which prevents it.
On the other hand, whenever a sentence is processed/cleaned after it is inserted into the database, the crypto-hash value in sentence_id becomes invalid (looses the assumed 1-1 relationship).
I'd recommend a single sentence clean-up library routine implemented in Typescript AND python and used everywhere consistently. This would also help down-stream libraries/applications to see what is done / what should be done.
The text was updated successfully, but these errors were encountered:
Thank you so much for this, I'm currently researching all the text corpus bugs and working with the team on getting these triaged. This is an incredibly detailed and helpful issue, I and the rest of the team really appreciate it.
If you need to work on reported.tsv with such problem, here is my read function, which solves MOST of the problems - not ALL of them, because problems cause by two different fields can be unpredictable.
Describe the bug
This is similar to #4406 .
The
reported.tsv
files are broken for many languages/dataset version, i.e. they do not conform to the expected four column (sentence_id, sentence, locale, reason) structure, because of LF, CRLF, TAB like characters in cells.These are not only caused by the text in the "sentence" field, but also in the "reason" field, where people can write their own reasons in the
textarea
presented to them if they select the "Other" option.To Reproduce
It is in many files, but just open the one in v17.0
en
dataset to see multiple forms.Expected behavior
input
field and/or cleaned at the entry, before inserting into the database.reported.tsv
, bothsentence
andreason
fieldsshould
be cleaned (one should not surround them with quotations as this will change the real values)Screenshots
Additional context
I recognized the problem after I posted #4406 . In my library I had a pandas
pd.read_csv
, where malformed lines were skipped. I changed it to "warn" and scanned allreported.tsv
files during analysis, where it output thousands of warnings. This also means my analysis results are also wrong, at least incomplete.This does not happen in validated, invalidated, other, train, dev, test.tsv files, as the CorporaCreator that produces them has pre-processing (cleaning) routines, which prevents it.
On the other hand, whenever a sentence is processed/cleaned after it is inserted into the database, the crypto-hash value in
sentence_id
becomes invalid (looses the assumed 1-1 relationship).I'd recommend a single sentence clean-up library routine implemented in Typescript AND python and used everywhere consistently. This would also help down-stream libraries/applications to see what is done / what should be done.
The text was updated successfully, but these errors were encountered: