Links to Data

We downloaded the StackOverflow dump which is hosted on this link. The data are in XML format. We parsed the XML files and dumped the data into a database using the DumpSO.py script.

Links to Data

C0: Raw data approach (Data, alignment and usages)

C1: Thread title approach (Data, alignment and usages)

C2: Standard NLP approach (Data, alignment and usages)

C3: Software engineering task approach (Data, alignment and usages)

Each data set consists of three folders: Alignment, Corpora and Usages. Contents in each folder are as follows:

Folder	Content
Alignment	CSV file containing the Per-word entropy values plotted in Figure 2 in the paper.
Corpora	Two text files: eng.txt and code.txt. These files contain the english and code corpora.
Usages	Two files with .dict extensions. Each file contains the usage frequencies for every English and code tokens. There are three columns in the files. The columns are explained in the following section.

Processing Data with OpenNMT

cd ~/OpenNMT

For tokenization run from command line: th tools/tokenize.lua -mode space < ~/data/eng.txt > ~/data/eng.tok th tools/tokenize.lua -mode space < ~/data/code.txt > ~/data/code.tok
For creating the dictionary run from the command line: th preprocess.lua -train_src ~/data/eng.tok -train_tgt ~/data/code.tok -keep_frequency true -save_data ~/data/dictionary

For each corpus there will be a dictionary with three columns (tab separated).

Example:

token	1-indexed token id	Occurance frequency
Drawable.Drawable	5	658

The first four tokens of each dictionary contains:

<blank>

<unk>

<s>

</s>

So, we remove them.

Besides creating the dictionary (the .dict files) previous command creates a binrary file with .t7 extension. For our purpose we do not need that file.

rm ~/data/*.t7

Align the corpora with Berkeley Aligner

From command line run: cd ~/SOParallelCorpusReplication/BerkeleyAligner
Split your corpus into two parts: 80% for training and 20% for testing
Run the following commands one after another:

mkdir -p ~/SOParallelCorpusReplication/BerkeleyAligner/data/train

mkdir ~/SOParallelCorpusReplication/BerkeleyAligner/data/test

Populate the BerkeleyAligner/data/train folder with train.en and train.cd
Populate the BerkeleyAligner/data/test folder with test.en and test.cd
run java -Xms2g -Xmx4g -jar berkeleyaligner.jar ++.conf configuration.conf

The last command will create a folder ~/SOParallelCorpusReplication/BerkeleyAligner/output and fill in the folder with many files. Among the files we need only stage2.1.params.txt. This file is the first argument for AlignmentEntropyStat.py.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
BerkeleyAligner		BerkeleyAligner
SourceCode		SourceCode
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BerkeleyAligner

BerkeleyAligner

SourceCode

SourceCode

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Links to Data

C0: Raw data approach (Data, alignment and usages)

C1: Thread title approach (Data, alignment and usages)

C2: Standard NLP approach (Data, alignment and usages)

C3: Software engineering task approach (Data, alignment and usages)

Processing Data with OpenNMT

Align the corpora with Berkeley Aligner

About

Releases 1

Packages

Languages

License

mrsumitbd/SOParallelCorpusReplication

Folders and files

Latest commit

History

Repository files navigation

Links to Data

Processing Data with OpenNMT

Align the corpora with Berkeley Aligner

About

Topics

Resources

License

Stars

Watchers

Forks

Languages