-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Retire toponym corpus from Catalan Common Voice #4337
Comments
Thanks so much for getting in touch with this issue. When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence? |
Thank you for your question. |
Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team. The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers? |
Thank you for your response. What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it? Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus. Kind regards |
@c-armentano, AFAIK there is one way. The If they are synthetically generated, like "We are going to [place]", they can also be manipulated using the sentence field. |
Describe the bug
These sentences are far too repetitive:
https://github.com/common-voice/common-voice/blob/main/server/data/ca/frases_agenda.txt.
We created them to obtain a corpus with all the toponyms of the Catalan-speaking area, but we weren't aware that they would be recorded more than once. Some volunteers complained they were repetitive, and they may lead to a phonetically unbalanced corpus.
To Reproduce
N/A
Expected behavior
We would like to prevent them to reappear to be recorded.
Screenshots
N/A
Desktop or Mobile (please complete the following information):
Additional Hardware (were you using headphones, an external speaker or an external microphone?):
Additional context
N/A
The text was updated successfully, but these errors were encountered: