Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Retire toponym corpus from Catalan Common Voice #4337

Open
c-armentano opened this issue Jan 23, 2024 · 5 comments
Open

[BUG] Retire toponym corpus from Catalan Common Voice #4337

c-armentano opened this issue Jan 23, 2024 · 5 comments
Labels
Bug Text Corpus Bugs or feature requests that are related to Text Corpus

Comments

@c-armentano
Copy link
Contributor

Describe the bug
These sentences are far too repetitive:
https://github.com/common-voice/common-voice/blob/main/server/data/ca/frases_agenda.txt.

We created them to obtain a corpus with all the toponyms of the Catalan-speaking area, but we weren't aware that they would be recorded more than once. Some volunteers complained they were repetitive, and they may lead to a phonetically unbalanced corpus.

To Reproduce
N/A

Expected behavior
We would like to prevent them to reappear to be recorded.

Screenshots
N/A

Desktop or Mobile (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional Hardware (were you using headphones, an external speaker or an external microphone?):

  • Type:
  • Model:

Additional context
N/A

@ftyers ftyers changed the title [BUG] [BUG] Retire toponym corpus from Catalan Common Voice Jan 23, 2024
@jessicarose
Copy link
Collaborator

Thanks so much for getting in touch with this issue.

When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence?

@jessicarose jessicarose added the Text Corpus Bugs or feature requests that are related to Text Corpus label Jan 24, 2024
@c-armentano
Copy link
Contributor Author

Thank you for your question.
Some of these sentences have been shown to Catalan voice contributors (and recorded) more than once (up to 4 times in v.16). I see that some others (about 1100) have never been recorded, but since they are too similar (same sentence structures with only the place names changing) we don't see the interest in recording them.

@jessicarose
Copy link
Collaborator

Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team.

The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers?

@c-armentano
Copy link
Contributor Author

Thank you for your response.

What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it?

Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus.

Kind regards

@HarikalarKutusu
Copy link
Contributor

Is there any way to get it?

@c-armentano, AFAIK there is one way. The is_used field in sentences table controls it. If it is set to 0 (false), it will not be shown for new recordings. On the other hand, you need to collect sentence_id's of all these sentences and make a PR changing the database.

If they are synthetically generated, like "We are going to [place]", they can also be manipulated using the sentence field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Text Corpus Bugs or feature requests that are related to Text Corpus
Projects
None yet
Development

No branches or pull requests

3 participants