Newspaper Topic Modelling 📰 🔍

NLP topic modelling of UK newspapers, with analysis of topics over time, as well as sentiment analysis of polarity and subjectivity of language used. Python data analysis and React JSX website presenting that analysis, which is live here: https://czboop.github.io/Newspaper-Topic-Modelling/

Project Summary

This project uses several techniques within natural language processing to explore seven of the top newspapers in the UK. Data analysed for all sources covered the period from just before the start of the COVID-19 pandemic (late November 2019), until the start of 2023 (early January). The newspapers analysed were:

The Express
The Dail Mail
The Sun
The Mirror
The Telegraph
The Guardian
Metro

Headlines were used exclusively, rather than the main body of articles, for all sources and all analysis. BERTopic (a Python package using SBERT, UMAP, HDBSCAN, CountVectorizer and c-TF-IDF to cluster text data into topics) was used to create topic clusters, as well as to perform other topic modelling related analysis. SpaCy TextBlob (a SpaCy Universe package implementing TextBlob sentiment analysis with SpaCy) was also used to analyse the subjectivity (level of being factual or opinionated) and polarity (level of being emotionally positive or negative) of different newspapers. Plotly was used to create new plots to represent this, as well as to manipulate the plots that are created by BERTopic.

Alongside the Python data analysis, a React web app was also created to present many of the findings from this analysis, and the graphs that visualise the data.

This repository contains a combination of a Python data directory, and a React web app directory used to present some of the findings in a more visual and user friendly way.

Data was scraped from the internet over a period of time, with a limited number of requests per minute. More information on the dataset can be found below. The dataset used as the basis of this analysis is not public and is not intended to be made public. The scraping scripts are not part of this or any other public repository.

Dataset Details and Limitations

The dataset used was collected from the websites of each of the respective newspapers, with slightly different techniques for some newspapers. Different newspapers had very different numbers of total documents, with The Daily Mail having by far the highest number of documents, and The Guardian having the lowest.

There were also varying levels of completeness in terms of what percentage of all headlines made it into the dataset, depending on source.

For the Daily Mail, a complete set of all headlines from this time period was collected. However, the extremely high number of documents from The Daily Mail was intially creating a model that was too large (in terms of memory - around 17GB) so the script would error while trying to fit the model. Due to this, some types of articles were removed from before training and analysis. This included a large chunk of documents that were re-published by The Daily Mail, but came from other sources such as Reuters or the Associated Press. Showbiz, Sport and Lifestyle articles were also removed. The Daily Mail still had by far the most documents even after this filtering.

Some other newspapers had high level categories limited at the point of data collection, but all of these were then analysed. This applied to The Telegraph, The Guardian, The Mirror and The Sun. These categories were largely based on the main news categories that the each newspaper used for their articles.

The categories collected and analysed (or not) for these newspapers can be seen in the table below. Note, in some cases absence of a category may mean that the newspaper does not flag articles with this label, while in other cases this may be a gap in the dataset. Categories that are not shown in the table (such as sport) can be assumed to be excluded for all of these sources:

	The Sun	The Mirror	The Telegraph	The Guardian
Politics	✔️	✔️	✔️	✔️
Science	✔️	✔️	✔️	✔️
Technology	❌	✔️	❌	✔️
UK News	✔️	✔️	✔️	❌
World News	✔️	✔️	✔️	❌
US News	❌	✔️	❌	❌
Health	❌	✔️	✔️	❌
Environment	❌	❌	✔️	✔️
Education	❌	❌	✔️	✔️
Royal Family	❌	✔️	✔️	❌
Business	❌	❌	❌	✔️
Society	❌	❌	❌	✔️
'More Hopeful'	❌	✔️	❌	❌
Defence	❌	❌	✔️	❌
Opinion	✔️	❌	❌	❌

On the other hand, the Metro and Daily Express newspapers had what should be a complete set of their headlines both collected and analysed.

Repository Contents

Some of the key repository contents:

📁 data: python files for data analysis
- 📁 src: main data content
  - 📁 plots: where data visualisations are saved
  - 📄 data_processor.py - a class used within other objects to load in and process data files
  - 📄 general_analyser.py - performs basic analysis on data e.g. ratio of documents by source, number of articles by month
  - 📄 multi_source_modeller.py - performs topic modelling on multiple sources one after the other
  - 📄 multi_source_sentiments.py - performs sentiment analysis on multiple sources one after the other
  - 📄 representative_docs.py - adds representative document to the hover tooltip of json file visualising topics
  - 📄 sentiment.py - analyses subjectivity and polarity, including over time, and creates visualisations of these
  - 📄 topic_modeller.py - finds topics from data and save results as plots
- 📁 tests: unit tests for the files in the data/src folder
📁 client_side/web-app: react web app to display analysis results
- 📁 src: main web app content
  - 📁 __tests__: smoke tests for components rendering and testing the navigation works
  - 📁 components: components used within the web app, including stylesheets for them
    - 📁 graph_data: json files of data visualisations to be imported into components
    - 📁 text_data: json files containing text content to be used in components
- 📁 public: web app html, icon, manifest and robots files

Tools Used

Languages:

Python - for data analysis
JavaScript - for front-end web app

Libraries/Frameworks:

Front-End

React JSX - primary framework for creating the web app
React Router - to create multiple routes/pages within the app
React Plotly JS - to represent and manipulate Plotly graphs within the web app
React Resize Detector - to handle page resize including altering page content dependent on size
CSS (including media queries) - for web styling, and handling mobile/screen size responsiveness

Testing (Front End)

React Testing Library including Jest DOM - to render components and select elements from the page, and create tests (primarily smoke tests as there is little user interaction with the page)

Data

BERTopic (including UMAP, HDBSCAN and sci-kit learn) - for topic modelling and many elements of analysis such as:
- Finding topics (including in order of frequency, with count of occurence)
- Creation of topic cluster and topics over time visualisations
- Getting representative documents per topic
- Getting topics over time
SpaCy (including SpaCy TextBlob) - for stopword removal and polarity/subjectivity analysis
Pandas - for creation of dataframes to store and manipulate data
Plotly - for saving and adjusting the plots created by BERTopic, as well as creating new plots based on sentiment analysis
Beautiful Soup - for scraping data to be analysed
httplib2 - for making requests as part of data scraping
Glob - for pattern based file and path selection (to read in data stored across multiple files)
Datetime and Dateutil - to select data from time ranges and iterate over time deltas
Pathlib, sys, shutil and os - for selecting, creating and deleting files and directories
Json - for encoding and decoding json files and data

Testing (Data)

Unittest - primary unit test framework, with test suites created as class of type unittest.TestCase
Pytest - to run tests from the command line
Pandas testing - to assert dataframe equality

How to Install and Run the Data Analysis

To get set up to run the Python/data portion of the project:

If Python is not installed, install it from this link.
Clone this repository, then navigate to the directory it is in.
Set up a virtual environment using:
$ python -m venv <evironment_name>
Activate the virtual environment. For Windows, this is done using:
$ <evironment_name>\Scripts\activate.bat
This link shows how to do this for other operating systems.
Install dependencies using:
$ pip install -r requirements.txt
After navigating to the directory with the desired file, one of the Python files can be run using:
$ python <filename>.py

NOTE: At least one of the dependencies may have issues running with the latest version of Python. Downgrading to version 3.7 in your virtual environment may be required. This can be done by downloading Python 3.7, and creating the virtual environment specifying that version: $ python3.7 -m venv <evironment_name>

The scripts are made up of classes/objects that take in as part of their constructor, a path to a directory that is expected to contain .csv files with the data to be analysed. This should be updated to reflect wherever your local data files are stored. The default can be updated in the Python files that define the classes, or a different path can be given when creating an instance of the class.

Also, the scripts make assumptions about the columns that should be present in the data ('headline', 'date', and 'url'), that should likely be updated to match any new data that they are being run on.

How to Use the Web App

Check out the React website hosted on GitHub Pages, which presents many of the findings of the topic modelling and sentiment analysis, as well as data visualistions. Link to website

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
data		data
web-app		web-app
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

data

data

web-app

web-app

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Newspaper Topic Modelling 📰 🔍

Project Summary

Dataset Details and Limitations

Repository Contents

Tools Used

Languages:

Libraries/Frameworks:

Front-End

Testing (Front End)

Data

Testing (Data)

How to Install and Run the Data Analysis

How to Use the Web App

About

Releases

Packages

Languages

CZboop/Newspaper-Topic-Modelling

Folders and files

Latest commit

History

Repository files navigation

Newspaper Topic Modelling 📰 🔍

Project Summary

Dataset Details and Limitations

Repository Contents

Tools Used

Languages:

Libraries/Frameworks:

Front-End

Testing (Front End)

Data

Testing (Data)

How to Install and Run the Data Analysis

How to Use the Web App

About

Topics

Resources

Stars

Watchers

Forks

Languages