[FR] Detail unvalidated text corpus status #4428

HarikalarKutusu · 2024-04-05T06:56:49Z

Is your feature request related to a problem? Please describe.
In the first iteration of the text-corpora release in v17.0, the status of sentences are not clear. AFAIK, there are 4 states for a sentence:

Added - not yet validated (=> Needed for community planning)
Added - Invalidated ( => One can revisit these sentences and correct the mistakes and repost.)
Added - Validated => Can be recorded (maybe already recorded)
Added - Validated => (Possibly Recorded) => Disabled (=> one may like to remove these from voice corpus to get better quality)

1 & 2 are in unvalidated_sentences.tsv file, but we have no way of distinguishing them
3 & 4 are in validated_sentences.tsv file, from clips_count column we can find if they are recorded (if >0), and from is_used field we can see if they are later disabled (if = 0)

During an analysis we need to distinguish all states, but we cannot get these:

How many sentences are waiting to be validated
How many sentences are invalidated

Describe the solution you'd like
IMO, the easiest way to do this is to add up/down vote count fields to the unvalidated_sentences.tsv file

Describe alternatives you've considered
One could also use invalidated_sentences.tsv and other_sentences.tsv files to distinguish, like it is done for voice-corpora, but former method would be better as it will have more data for analysis and will result in less files.

The text was updated successfully, but these errors were encountered:

jessicarose added Enhancement A idea to enhance and existing feature or process on Common Voice Text Corpus Bugs or feature requests that are related to Text Corpus labels Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Detail unvalidated text corpus status #4428

[FR] Detail unvalidated text corpus status #4428

HarikalarKutusu commented Apr 5, 2024

[FR] Detail unvalidated text corpus status #4428

[FR] Detail unvalidated text corpus status #4428

Comments

HarikalarKutusu commented Apr 5, 2024