You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When I analyzed text corpus files from v17.0, in many of the locales, I found out that some sentence_id's are duplicated. I don't know the exact reason and a systematic source.
If someone can direct me to the related code, I can have a look to find the reason.
To Reproduce
Steps to reproduce the behavior: Analyze the validated_sentences.tsv files with pandas and get the duplicates.
Expected behavior
A sentence_id should exist only one time (assuming there is no hash collision, which is unlikely)
Screenshots
Example from ka locale:
sentence_id's on the snapshot:
01c596b07467cfe5c99b5a1341891404ee80d3bcb81521f6421507d464cd50de
01c6552a75163a62e3ca06f8ca68e024083351a03a7d0213d5c0a0947cd95464
Additional context
This can be deduplicated at application level, but be careful, in some of them is_used and/or clips_count values seems to be different. So, for example when using pandas DataFrames, one should not deduplicate the whole rows, but get sentence_id's and make them unique.
If the above procedure is used, some information loss can occur
The text was updated successfully, but these errors were encountered:
Describe the bug
When I analyzed text corpus files from v17.0, in many of the locales, I found out that some sentence_id's are duplicated. I don't know the exact reason and a systematic source.
If someone can direct me to the related code, I can have a look to find the reason.
To Reproduce
Steps to reproduce the behavior: Analyze the validated_sentences.tsv files with pandas and get the duplicates.
Expected behavior
A sentence_id should exist only one time (assuming there is no hash collision, which is unlikely)
Screenshots
Example from
ka
locale:sentence_id's on the snapshot:
01c596b07467cfe5c99b5a1341891404ee80d3bcb81521f6421507d464cd50de
01c6552a75163a62e3ca06f8ca68e024083351a03a7d0213d5c0a0947cd95464
Additional context
The text was updated successfully, but these errors were encountered: