Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Detail unvalidated text corpus status #4428

Open
HarikalarKutusu opened this issue Apr 5, 2024 · 0 comments
Open

[FR] Detail unvalidated text corpus status #4428

HarikalarKutusu opened this issue Apr 5, 2024 · 0 comments
Labels
Enhancement A idea to enhance and existing feature or process on Common Voice Text Corpus Bugs or feature requests that are related to Text Corpus

Comments

@HarikalarKutusu
Copy link
Contributor

Is your feature request related to a problem? Please describe.
In the first iteration of the text-corpora release in v17.0, the status of sentences are not clear. AFAIK, there are 4 states for a sentence:

  1. Added - not yet validated (=> Needed for community planning)
  2. Added - Invalidated ( => One can revisit these sentences and correct the mistakes and repost.)
  3. Added - Validated => Can be recorded (maybe already recorded)
  4. Added - Validated => (Possibly Recorded) => Disabled (=> one may like to remove these from voice corpus to get better quality)
  • 1 & 2 are in unvalidated_sentences.tsv file, but we have no way of distinguishing them
  • 3 & 4 are in validated_sentences.tsv file, from clips_count column we can find if they are recorded (if >0), and from is_used field we can see if they are later disabled (if = 0)

During an analysis we need to distinguish all states, but we cannot get these:

  • How many sentences are waiting to be validated
  • How many sentences are invalidated

Describe the solution you'd like
IMO, the easiest way to do this is to add up/down vote count fields to the unvalidated_sentences.tsv file

Describe alternatives you've considered
One could also use invalidated_sentences.tsv and other_sentences.tsv files to distinguish, like it is done for voice-corpora, but former method would be better as it will have more data for analysis and will result in less files.

@jessicarose jessicarose added Enhancement A idea to enhance and existing feature or process on Common Voice Text Corpus Bugs or feature requests that are related to Text Corpus labels Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A idea to enhance and existing feature or process on Common Voice Text Corpus Bugs or feature requests that are related to Text Corpus
Projects
None yet
Development

No branches or pull requests

2 participants