[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406

HarikalarKutusu · 2024-03-22T15:37:19Z

Describe the bug
I found this when checking on the new validated_sentences.tsv, where the row is split into two or more. This happens in many languages. It seems to be connected to the <textarea> tag in the new write page. Any prior source (.txt files from github, old sentence collector, etc do not have them - as far as I can see).

Because the write page presents a textarea, people can enter CR and/or LF and/or TAB characters, on multiple lines. As they are not visible, it is not easy to pinpoint until you output it in a file. So, why would anyone do this (multi-line entry)?

People tend to press enter after inputting a line (then they press the submit button)
As multi-line entry is allowed, they might think they can enter multiple sentences
They could have collected the sentences in a Word document and copy-paste them to the web, which might introduce unseen characters.
They can just write downwards, line by line for fun.
...

To Reproduce
Check your dataset and/or try pressing ENTER on write page.

Expected behavior
The invisible characters such as CR and LF should not enter the database.

Screenshots

Taken from Turkish v17.0 dataset.

From lv locale, as an example of multi sentence entry (id: 1abe332a32c15a932b18eb7f9a4548e93578b1b44d2b162ddccf829486c4de5c):

From lg, where the sentence contains many tab characters which confuses .tsv readers/parsers (id: 62fb289d81ac03a947a4e46071fae5e29b73c6157dab536540c46b7f89e2821e):

Additional context

There are three points where you can correct this:

Convert textarea tag to input tag, so you cannot enter multi-line text.
After submit, clean-up the entry from invisible characters (\n and \t are most common)
Clean-up before outputting the *_sentences.tsv files.

In my opinion, the correct course of action is:

Use an input tag for a single line, it is more intuitive
Re-check the input validation for invisible characters
Clean-up the current database for such instances (CR, LF, CR/LF, TAB)
After these, probably a conversion/filtering would be not needed on the export phase, but anyway, it is also possible to add.

PS: We checked this with @moz-dfeller on PM who could do a DB query. I post it here for completeness.

The text was updated successfully, but these errors were encountered:

jessicarose · 2024-04-04T12:17:47Z

Thanks so much for raising this, I'll bring this to the team for further investigation but you've given us just a fantastic amount of context here to start from.

HarikalarKutusu · 2024-04-15T09:27:30Z

For those who need to work on the released text-corpora, here is my custom load script, which fixes ALL problems in v17.0, although bulky and low performance.

https://github.com/HarikalarKutusu/cv-tbox-dataset-compiler/blob/74a788a012fa64a241b4248805ded7a59f196bac/lib.py#L192

HarikalarKutusu added the Bug label Mar 22, 2024

HarikalarKutusu mentioned this issue Mar 30, 2024

[PR] Complete rework for changes in CV v17.0 HarikalarKutusu/cv-tbox-dataset-compiler#34

Merged

HarikalarKutusu mentioned this issue Apr 7, 2024

[BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields #4429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406

[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406

HarikalarKutusu commented Mar 22, 2024 •

edited

jessicarose commented Apr 4, 2024

HarikalarKutusu commented Apr 15, 2024

[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406

[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406

Comments

HarikalarKutusu commented Mar 22, 2024 • edited

jessicarose commented Apr 4, 2024

HarikalarKutusu commented Apr 15, 2024

HarikalarKutusu commented Mar 22, 2024 •

edited