[BUG] Retire toponym corpus from Catalan Common Voice #4337

c-armentano · 2024-01-23T08:24:58Z

Describe the bug
These sentences are far too repetitive:
https://github.com/common-voice/common-voice/blob/main/server/data/ca/frases_agenda.txt.

We created them to obtain a corpus with all the toponyms of the Catalan-speaking area, but we weren't aware that they would be recorded more than once. Some volunteers complained they were repetitive, and they may lead to a phonetically unbalanced corpus.

To Reproduce
N/A

Expected behavior
We would like to prevent them to reappear to be recorded.

Screenshots
N/A

Desktop or Mobile (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional Hardware (were you using headphones, an external speaker or an external microphone?):

Type:
Model:

Additional context
N/A

jessicarose · 2024-01-24T14:26:02Z

Thanks so much for getting in touch with this issue.

When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence?

c-armentano · 2024-01-26T10:48:22Z

Thank you for your question.
Some of these sentences have been shown to Catalan voice contributors (and recorded) more than once (up to 4 times in v.16). I see that some others (about 1100) have never been recorded, but since they are too similar (same sentence structures with only the place names changing) we don't see the interest in recording them.

jessicarose · 2024-02-22T12:10:18Z

Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team.

The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers?

c-armentano · 2024-03-04T15:46:03Z

Thank you for your response.

What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it?

Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus.

Kind regards

HarikalarKutusu · 2024-05-26T10:03:42Z

Is there any way to get it?

@c-armentano, AFAIK there is one way. The is_used field in sentences table controls it. If it is set to 0 (false), it will not be shown for new recordings. On the other hand, you need to collect sentence_id's of all these sentences and make a PR changing the database.

If they are synthetically generated, like "We are going to [place]", they can also be manipulated using the sentence field.

c-armentano added the Bug label Jan 23, 2024

ftyers changed the title ~~[BUG]~~ [BUG] Retire toponym corpus from Catalan Common Voice Jan 23, 2024

jessicarose added the Text Corpus Bugs or feature requests that are related to Text Corpus label Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Retire toponym corpus from Catalan Common Voice #4337

[BUG] Retire toponym corpus from Catalan Common Voice #4337

c-armentano commented Jan 23, 2024

jessicarose commented Jan 24, 2024

c-armentano commented Jan 26, 2024

jessicarose commented Feb 22, 2024

c-armentano commented Mar 4, 2024

HarikalarKutusu commented May 26, 2024

[BUG] Retire toponym corpus from Catalan Common Voice #4337

[BUG] Retire toponym corpus from Catalan Common Voice #4337

Comments

c-armentano commented Jan 23, 2024

jessicarose commented Jan 24, 2024

c-armentano commented Jan 26, 2024

jessicarose commented Feb 22, 2024

c-armentano commented Mar 4, 2024

HarikalarKutusu commented May 26, 2024