Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets #4454

chasingdragonflies · 2024-04-22T18:58:40Z

I wish for the merging of both the gom and knn locales. However, this might not be possible in its entirety.

But, as a start, the sentences collected by both locales (romi and devanagari) could be linked to each other in their respective datasets. (User Interface wise, maybe a small switch could allow people to switch between the gom/knn version of the same sentence.)

For example, these two sentences are the same, but they are written with different pronunciation as well as the writting script.

gom:

Kott'tti, zholli, pixvi sogllem ekuch tor tum tache pixvent ek futtki kouddi voddoina ani voilean tachim fokannam korta?

knn:

कट्टी, झोळी, पीषवी सगळें एकूच तर तूूं ताचे पषिवेंत एक फुटकी कवडी वडयना आनी वयल्यान ताचीं फकाणाूं करता?

If the UI and datasets of Konkani locales "romi" and "devanagari" cannot be merged, then at least the sentences could be linked. It would provide an easier way to further use both datasets for speech recognition as well as uses such as transliterating between scripts of the same language.

How?
In the roman dataset, (in sentences.tsv) a new column could be added "devanagari_sentence_id". This should be the sentence ID of the matching devanagari sentence present in the devanagari dataset.
Same, in the devanagari dataset, "roman_sentence_id".

A suggestion about website navigation:
On the website, when users click "Konkani" in the language options, it would be nice if the user went to a page (commonvoice.mozilla.org/kok, being konkani's inclusive language code), where they can be asked "Which type of konkani do you speak?". And two options one greeting them in the Devanagari orthography. And other option greeting the user in Roman orthography. This would help in setting the language preference of the user before recording begins. But also, this means the entire website needs to be translated in their respective writing scripts.

Which means, merging of locale cannot be done on the website UI level.

chasingdragonflies · 2024-04-22T19:17:57Z

Related to #3266

alvynabranches · 2024-04-23T05:08:28Z

Giving two options is not possible on a same dataset is not possible hence there is a separate language created for Devanagari and Roman scripts. If this has to be created, we would have to make a custom platform for it. @chasingdragonflies

chasingdragonflies · 2024-04-23T07:24:23Z

Tools that can aid in transliteration and linking of sentences:
• https://konkanverter.com, developed by World Konkani Center in Mangalore (automatic transliteration tool, although it's not perfect)
• English to Devanagari-konkani dictionary (official Goa Konkani Basha Mandal app co-developed with shabdkosh)
• Modern English to Roman-konkani dictionary, a PDF file uploaded to wikimedia commons.

chasingdragonflies · 2024-04-23T07:31:02Z

Transliteration of konkani is not as simple as converting devanagari words to english pronunciation.

Here are some examples for the understanding of non-konkani speakers. (Use hindi text-to-speech for pronunciation)

For "sweet taste":
In gom, we say "ghod" (pronunciation: घोड)
In knn, we say "ghad" (pronunciation: घड)
For "my":
In gom, we say "mojea" (pronunciation: म्हौज्या)
In knn, we say "mhajea" (pronunciation: म्हज्या)

However, there are some words which have same pronunciation in both konkani scripts..
3. For "is happening":
In gom, we say "zata"
In knn, we also say "zata" (pronunciation: जाता)

chasingdragonflies · 2024-04-23T07:38:11Z

Mozilla common voice can introduce a new tab, "Link", alongside the speak, listen, write, review tabs for the purpose of linking the devanagari and roman sentences across the dataets. This will also boost sentence collection for both datasets

The flow could be: Write -> Review (linguistically correct and copyright-free?) -> Link (type the corresponding devanagari/roman sentence) -> Record -> Validate

Once linked, the other language script can have the flow:
Review (linguistically correct?) -> Record -> Validate.

This way we can collect sentences for both the datasets together!

In the contribution guidelines, for linking purpose, we can suggest the tools i mentioned here which will make it easier to convert the sentences between the devanagari and roman scripts.

chasingdragonflies · 2024-04-23T13:51:44Z

https://discourse.mozilla.org/t/open-call-join-common-voice-in-introducing-variant-support-for-language-variants/128998

Checkout this post on MCV multi-orthography feature @alvynabranches @anniedempe

chasingdragonflies · 2024-04-24T04:02:24Z

@anniedhempe

chasingdragonflies · 2024-04-24T04:02:39Z

Last date is 25 april 2024

ginamoape · 2024-04-24T06:55:54Z

Hi @chasingdragonflies

Thank you for the recommended adjustments and for the valuable insights you provided. While we're eager to implement the proposed modifications, it's important for us to consult with the community members and translators who requested the addition of the languages (knn and gom) to Common Voice, this is to ensure that we're accommodating the linguistic preferences of the entire community.

chasingdragonflies · 2024-04-30T17:09:50Z

Hello @thak123 @anniedhempe and @alvynabranches

As you already know, Konkani in देवनागरी (devanagari) is the official standard for reading and writing konkani in Goa. However there are variants or dialects of it spoken in smaller communities in the konkan region. Meaning, a person living in maharastra's konkan region will speak differently from person living in margao, and both of these people will speak differently than a person living in mangalore. Also, not all people will be able to read sentences in देवनागरी because mangalorean people read in kannada script. While many catholics in Goa prefer to read/speak in Romi-konkani.

There was a Mozilla Discourse post in July 2022 which talked about the introduction of Language Variants. Please watch the video linked in the post.

There is also another MCV post in March 2024 Multi-Orthography for language variants introducing support for languages with different writing system.

I am aware that currently the dataset is split into two locales for konkani. But, I think whole-heartedly that combining both of these datasets is possible because the difference of Romi and Devanagari is not much at all! The difference is mainly only in the pronunciation and writing script. With the upcoming introduction of multi-orthography for Language Variants that have multiple writing system/script, it is possible to use one locale (same dataset) for languages with multiple variants! Refer the 2024 discourse post.

Mozilla is already trying to support languages with multiple orthography. With the features i have suggested on the main post of this issue, it would enhance the participation of everyone in building together the konkani common voice dataset.

I am trying to plan in advance before all the work of translation and sentence collection is done for the konkani language.

Please send a comment if you are okay with combining the dataset on the grounds that users can choose a "language variant" + "writing system" before they record their voices. With your support Mozilla will hopefully make the appropriate changes..

तुमचें वीचार एकदम गरजेचे आसा.

This comment was marked as resolved.

Sign in to view

This comment was marked as duplicate.

Sign in to view

This comment was marked as outdated.

Sign in to view

chasingdragonflies changed the title ~~Mutli-orthography for Konkani - merging the gom and knn locales~~ Mutli-orthography for Konkani - linking sentences in the gom and knn datasets Apr 30, 2024

chasingdragonflies changed the title ~~Mutli-orthography for Konkani - linking sentences in the gom and knn datasets~~ Mutli-orthography for Konkani - linking sentences collected in the gom and knn datasets Apr 30, 2024

chasingdragonflies changed the title ~~Mutli-orthography for Konkani - linking sentences collected in the gom and knn datasets~~ Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets #4454

Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets #4454

chasingdragonflies commented Apr 22, 2024 •

edited

This comment was marked as resolved.

chasingdragonflies commented Apr 22, 2024

alvynabranches commented Apr 23, 2024 •

edited

This comment was marked as duplicate.

chasingdragonflies commented Apr 23, 2024 •

edited

chasingdragonflies commented Apr 23, 2024 •

edited

chasingdragonflies commented Apr 23, 2024 •

edited

This comment was marked as outdated.

chasingdragonflies commented Apr 23, 2024

chasingdragonflies commented Apr 24, 2024

chasingdragonflies commented Apr 24, 2024

ginamoape commented Apr 24, 2024

chasingdragonflies commented Apr 30, 2024 •

edited

Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets #4454

Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets #4454

Comments

chasingdragonflies commented Apr 22, 2024 • edited

This comment was marked as resolved.

chasingdragonflies commented Apr 22, 2024

alvynabranches commented Apr 23, 2024 • edited

This comment was marked as duplicate.

chasingdragonflies commented Apr 23, 2024 • edited

chasingdragonflies commented Apr 23, 2024 • edited

chasingdragonflies commented Apr 23, 2024 • edited

This comment was marked as outdated.

chasingdragonflies commented Apr 23, 2024

chasingdragonflies commented Apr 24, 2024

chasingdragonflies commented Apr 24, 2024

ginamoape commented Apr 24, 2024

chasingdragonflies commented Apr 30, 2024 • edited

chasingdragonflies commented Apr 22, 2024 •

edited

alvynabranches commented Apr 23, 2024 •

edited

chasingdragonflies commented Apr 23, 2024 •

edited

chasingdragonflies commented Apr 23, 2024 •

edited

chasingdragonflies commented Apr 23, 2024 •

edited

chasingdragonflies commented Apr 30, 2024 •

edited