[FR] Detail unvalidated text corpus status #4428
Labels
Enhancement
A idea to enhance and existing feature or process on Common Voice
Text Corpus
Bugs or feature requests that are related to Text Corpus
Is your feature request related to a problem? Please describe.
In the first iteration of the text-corpora release in v17.0, the status of sentences are not clear. AFAIK, there are 4 states for a sentence:
unvalidated_sentences.tsv
file, but we have no way of distinguishing themvalidated_sentences.tsv
file, fromclips_count
column we can find if they are recorded (if >0), and fromis_used
field we can see if they are later disabled (if = 0)During an analysis we need to distinguish all states, but we cannot get these:
Describe the solution you'd like
IMO, the easiest way to do this is to add up/down vote count fields to the
unvalidated_sentences.tsv
fileDescribe alternatives you've considered
One could also use
invalidated_sentences.tsv
andother_sentences.tsv
files to distinguish, like it is done for voice-corpora, but former method would be better as it will have more data for analysis and will result in less files.The text was updated successfully, but these errors were encountered: