Shared task for AACIMP ACS 2014

Text language detection

Given a text return its language or a map of languages with confidence.

Example:

Text:
This is a test.

Output:
'en': 0.95, 'af': 0.03

Existing tools:

language-detection library based on character LM

Task

Create a software tool, that can be run from the Unix command-line and given a text print to standard output the map of detected languages with confidence scores in JSON format.

Example usage:

$ detect-lang file.txt    <-- console invocation
{"en": 0.67, "bg" 0.3}    <-- standard output

The text can be in one of the following languages: Afrikaans (af), English (en), German (de), Russian (ru), Ukrainian (ua), Polish (pl), Bulgarian (bg).

Submission guidelines

A test data set of documents extracted randomly from Wikipedia will be published. Evaluation will be performed on a similarly distributed held-out data set form the same source. Sizes of documents will vary from 3 to 3000 words.

Evaluation formula (in Python):

score = sum(map(lambda x,y: x[y] or 0, detected_langs, correct_langs))

Where detected_langs are maps produced by the tool, correct_langs is a list of languages for the texts given.

To be eligible for prizes you should beat the accuracy of language-detection:

on test data it is: 0.8584281
on evaluation data it is: 0.8660753

Note, that you have unfair advantage over it, because it detects 53 languages, and you need to distinguish only among 7. :)

File test.zip contains 1000 text snippets extracted from Wikipedia that were used for testing. The evaluation set of documents will be similar.

Submission should be performed in the form of a pull request to this repository.

You should fork the repository to your github account.
You should add a folder named after your team and put in it a text file README listing all participants and comments on your how to setup your solution if needed (optional).
After finishing the task, send a pull request.
Optionally, but ideally, please include a description of your approach in the README.

Submission deadline: 00:00 14 Aug

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
highway_to_hell		highway_to_hell
.gitignore		.gitignore
README.md		README.md
test.zip		test.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

highway_to_hell

highway_to_hell

.gitignore

.gitignore

README.md

README.md

test.zip

test.zip

Repository files navigation

Shared task for AACIMP ACS 2014

Text language detection

Task

Submission guidelines

About

Releases

Packages

Languages

Ignotus/langdetect

Folders and files

Latest commit

History

Repository files navigation

Shared task for AACIMP ACS 2014

Text language detection

Task

Submission guidelines

About

Topics

Resources

Stars

Watchers

Forks

Languages