You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I found this when checking on the new validated_sentences.tsv, where the row is split into two or more. This happens in many languages. It seems to be connected to the <textarea> tag in the new write page. Any prior source (.txt files from github, old sentence collector, etc do not have them - as far as I can see).
Because the write page presents a textarea, people can enter CR and/or LF and/or TAB characters, on multiple lines. As they are not visible, it is not easy to pinpoint until you output it in a file. So, why would anyone do this (multi-line entry)?
People tend to press enter after inputting a line (then they press the submit button)
As multi-line entry is allowed, they might think they can enter multiple sentences
They could have collected the sentences in a Word document and copy-paste them to the web, which might introduce unseen characters.
They can just write downwards, line by line for fun.
...
To Reproduce
Check your dataset and/or try pressing ENTER on write page.
Expected behavior
The invisible characters such as CR and LF should not enter the database.
Screenshots
Taken from Turkish v17.0 dataset.
From lv locale, as an example of multi sentence entry (id: 1abe332a32c15a932b18eb7f9a4548e93578b1b44d2b162ddccf829486c4de5c):
From lg, where the sentence contains many tab characters which confuses .tsv readers/parsers (id: 62fb289d81ac03a947a4e46071fae5e29b73c6157dab536540c46b7f89e2821e):
Additional context
There are three points where you can correct this:
Convert textarea tag to input tag, so you cannot enter multi-line text.
After submit, clean-up the entry from invisible characters (\n and \t are most common)
Clean-up before outputting the *_sentences.tsv files.
In my opinion, the correct course of action is:
Use an input tag for a single line, it is more intuitive
Re-check the input validation for invisible characters
Clean-up the current database for such instances (CR, LF, CR/LF, TAB)
After these, probably a conversion/filtering would be not needed on the export phase, but anyway, it is also possible to add.
PS: We checked this with @moz-dfeller on PM who could do a DB query. I post it here for completeness.
The text was updated successfully, but these errors were encountered:
Thanks so much for raising this, I'll bring this to the team for further investigation but you've given us just a fantastic amount of context here to start from.
For those who need to work on the released text-corpora, here is my custom load script, which fixes ALL problems in v17.0, although bulky and low performance.
Describe the bug
I found this when checking on the new
validated_sentences.tsv
, where the row is split into two or more. This happens in many languages. It seems to be connected to the <textarea> tag in the newwrite
page. Any prior source (.txt files from github, old sentence collector, etc do not have them - as far as I can see).Because the write page presents a
textarea
, people can enter CR and/or LF and/or TAB characters, on multiple lines. As they are not visible, it is not easy to pinpoint until you output it in a file. So, why would anyone do this (multi-line entry)?To Reproduce
Check your dataset and/or try pressing ENTER on write page.
Expected behavior
The invisible characters such as CR and LF should not enter the database.
Screenshots
Taken from Turkish v17.0 dataset.
From
lv
locale, as an example of multi sentence entry (id: 1abe332a32c15a932b18eb7f9a4548e93578b1b44d2b162ddccf829486c4de5c):From
lg
, where the sentence contains many tab characters which confuses .tsv readers/parsers (id: 62fb289d81ac03a947a4e46071fae5e29b73c6157dab536540c46b7f89e2821e):Additional context
There are three points where you can correct this:
textarea
tag toinput
tag, so you cannot enter multi-line text.*_sentences.tsv
files.In my opinion, the correct course of action is:
input
tag for a single line, it is more intuitivePS: We checked this with @moz-dfeller on PM who could do a DB query. I post it here for completeness.
The text was updated successfully, but these errors were encountered: