Given a text return its language or a map of languages with confidence.
Example:
Text:
This is a test.
Output:
'en': 0.95, 'af': 0.03
Existing tools:
- language-detection library based on character LM
Create a software tool, that can be run from the Unix command-line and given a text print to standard output the map of detected languages with confidence scores in JSON format.
Example usage:
$ detect-lang file.txt <-- console invocation
{"en": 0.67, "bg" 0.3} <-- standard output
The text can be in one of the following languages: Afrikaans (af), English (en), German (de), Russian (ru), Ukrainian (ua), Polish (pl), Bulgarian (bg).
A test data set of documents extracted randomly from Wikipedia will be published. Evaluation will be performed on a similarly distributed held-out data set form the same source. Sizes of documents will vary from 3 to 3000 words.
Evaluation formula (in Python):
score = sum(map(lambda x,y: x[y] or 0, detected_langs, correct_langs))
Where detected_langs
are maps produced by the tool,
correct_langs
is a list of languages for the texts given.
To be eligible for prizes you should beat the accuracy of language-detection
:
- on test data it is:
0.8584281
- on evaluation data it is:
0.8660753
Note, that you have unfair advantage over it, because it detects 53 languages, and you need to distinguish only among 7. :)
File test.zip
contains 1000 text snippets extracted from Wikipedia that were used for testing.
The evaluation set of documents will be similar.
Submission should be performed in the form of a pull request to this repository.
- You should fork the repository to your github account.
- You should add a folder named after your team and put in it a text file
README
listing all participants and comments on your how to setup your solution if needed (optional). - After finishing the task, send a pull request.
- Optionally, but ideally, please include a description of your approach in the
README
.
Submission deadline: 00:00 14 Aug