Bughunt!

Introduction

This repository contains scripts and Jupyter notebooks for Bughunt! 2019, a guided project undertaken by Faculty of Education students as part of the Cambridge Digital Humanities Learning programme.

The project focuses on the distant reading of English children's literature 1789-1914 for the representation of insects and other creepy crawlies.

Participants collated and manually cleaned a corpus of out-of-copyright books from a number of sources, which may be found in the 1-initial-manual-clean folder.

Code details

The aim of the code in this repository is to show basic text mining techniques to participants who are complete beginners to both text mining and coding. As such, the script code is simple, linear and heavily commented to aid understanding, and the notebooks are designed to be run as a teaching aid, not to generate analysis results. The notebooks formed the basis of the one of the workshops held with the group on 5 March 2019.

The main libraries used in this repository are:

Natural Language Toolkit (NLTK)
Pandas
Matplotlib

Getting started

Jupyter notebooks

The easiest way to run the Jupyter notebooks in this repository is to click on the Binder button above. This will launch a virtual environment in your browser where you can open and run the notebooks without installing anything.

If you want to run the notebooks locally, follow the instructions for cloning and installing requirements as in the section below, and then run the notebooks in the way you would normally on your machine. (If you are not familiar with Jupyter notebooks yet then just use the Binder button to start with.)

There are 3 notebooks under the notebooks folder:

1-intro-to-strings.ipynb - a generic introduction to Jupyter notebooks and basic string functions in Python.
2-text-processing-corpus.ipynb - a walk through of the principles behind the first part of the script insect-freq-unigram.py, from tokenising the files to generating the frequency distribution.
3-visualising-data.ipynb - a walk through of the principles behind the second part of insect-freq-unigram.py whereby the basic graph is plotted.

Analysis scripts

To run the analysis scripts themselves, clone the repository and install the required libraries from requirements.txt.

$ pip install -r requirements.txt

It is strongly recommended that you do so in a venv virtual environment. (RealPython has an excellent tutorial on virtual environments.)

You will also need to download the corpus of stopwords and the punkt tokenizer from nltk_data. See the official documentation: Installing NLTK Data

Run the scripts in this order (with the root of the repo as the working directory):

files-by-decade.py - takes the cleaned corpus and splits it into files arranged by decade of publication.
insect-freq-unigram.py - tokenises the corpus, runs a frequency distibution and outputs a CSV file of the data and a PNG of the data plotted on a line graph.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
corpora/bughunt		corpora/bughunt
notebooks		notebooks
output		output
scripts		scripts
wordlists		wordlists
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpora/bughunt

corpora/bughunt

notebooks

notebooks

output

output

scripts

scripts

wordlists

wordlists

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Bughunt!

Introduction

Code details

Getting started

Jupyter notebooks

Analysis scripts

About

Releases

Packages

Languages

License

mchesterkadwell/bughunt-analysis

Folders and files

Latest commit

History

Repository files navigation

Bughunt!

Introduction

Code details

Getting started

Jupyter notebooks

Analysis scripts

About

Topics

Resources

License

Stars

Watchers

Forks

Languages