Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406

Open
HarikalarKutusu opened this issue Mar 22, 2024 · 2 comments
Labels

Comments

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Mar 22, 2024

Describe the bug
I found this when checking on the new validated_sentences.tsv, where the row is split into two or more. This happens in many languages. It seems to be connected to the <textarea> tag in the new write page. Any prior source (.txt files from github, old sentence collector, etc do not have them - as far as I can see).

Because the write page presents a textarea, people can enter CR and/or LF and/or TAB characters, on multiple lines. As they are not visible, it is not easy to pinpoint until you output it in a file. So, why would anyone do this (multi-line entry)?

  • People tend to press enter after inputting a line (then they press the submit button)
  • As multi-line entry is allowed, they might think they can enter multiple sentences
  • They could have collected the sentences in a Word document and copy-paste them to the web, which might introduce unseen characters.
  • They can just write downwards, line by line for fun.
  • ...

To Reproduce
Check your dataset and/or try pressing ENTER on write page.

Expected behavior
The invisible characters such as CR and LF should not enter the database.

Screenshots

image
Taken from Turkish v17.0 dataset.

From lv locale, as an example of multi sentence entry (id: 1abe332a32c15a932b18eb7f9a4548e93578b1b44d2b162ddccf829486c4de5c):
image

From lg, where the sentence contains many tab characters which confuses .tsv readers/parsers (id: 62fb289d81ac03a947a4e46071fae5e29b73c6157dab536540c46b7f89e2821e):
image

Additional context

There are three points where you can correct this:

  • Convert textarea tag to input tag, so you cannot enter multi-line text.
  • After submit, clean-up the entry from invisible characters (\n and \t are most common)
  • Clean-up before outputting the *_sentences.tsv files.

In my opinion, the correct course of action is:

  • Use an input tag for a single line, it is more intuitive
  • Re-check the input validation for invisible characters
  • Clean-up the current database for such instances (CR, LF, CR/LF, TAB)
  • After these, probably a conversion/filtering would be not needed on the export phase, but anyway, it is also possible to add.

PS: We checked this with @moz-dfeller on PM who could do a DB query. I post it here for completeness.

@jessicarose
Copy link
Collaborator

Thanks so much for raising this, I'll bring this to the team for further investigation but you've given us just a fantastic amount of context here to start from.

@HarikalarKutusu
Copy link
Contributor Author

For those who need to work on the released text-corpora, here is my custom load script, which fixes ALL problems in v17.0, although bulky and low performance.

https://github.com/HarikalarKutusu/cv-tbox-dataset-compiler/blob/74a788a012fa64a241b4248805ded7a59f196bac/lib.py#L192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants