Skip to content

Ignotus/langdetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Shared task for AACIMP ACS 2014

Text language detection

Given a text return its language or a map of languages with confidence.

Example:

Text:
This is a test.

Output:
'en': 0.95, 'af': 0.03

Existing tools:

Task

Create a software tool, that can be run from the Unix command-line and given a text print to standard output the map of detected languages with confidence scores in JSON format.

Example usage:

$ detect-lang file.txt    <-- console invocation
{"en": 0.67, "bg" 0.3}    <-- standard output

The text can be in one of the following languages: Afrikaans (af), English (en), German (de), Russian (ru), Ukrainian (ua), Polish (pl), Bulgarian (bg).

Submission guidelines

A test data set of documents extracted randomly from Wikipedia will be published. Evaluation will be performed on a similarly distributed held-out data set form the same source. Sizes of documents will vary from 3 to 3000 words.

Evaluation formula (in Python):

score = sum(map(lambda x,y: x[y] or 0, detected_langs, correct_langs))

Where detected_langs are maps produced by the tool, correct_langs is a list of languages for the texts given.

To be eligible for prizes you should beat the accuracy of language-detection:

  • on test data it is: 0.8584281
  • on evaluation data it is: 0.8660753

Note, that you have unfair advantage over it, because it detects 53 languages, and you need to distinguish only among 7. :)

File test.zip contains 1000 text snippets extracted from Wikipedia that were used for testing. The evaluation set of documents will be similar.

Submission should be performed in the form of a pull request to this repository.

  1. You should fork the repository to your github account.
  2. You should add a folder named after your team and put in it a text file README listing all participants and comments on your how to setup your solution if needed (optional).
  3. After finishing the task, send a pull request.
  4. Optionally, but ideally, please include a description of your approach in the README.

Submission deadline: 00:00 14 Aug

Releases

No releases published

Packages

No packages published