Progress: █████████████████████████████████████████████████░░░░░
(90%)
Street Manager Monthly Permit Pipeline example:
Important
Currently working on
-
Finalising the Evidence dashboard.
-
Integrating Geoplace's SWA Code List.
-
Speeding up SRWR processing and improving its integration into the project.
This repository contains an efficient data pipeline for processing:
-
DfT's Street Manager archived permit data.
-
Scottish Road Works Register (SRWR) archived permit data.
Note
The aim of this project is simple...
Reduce the time it takes to deliver value from open street works data.
- It's fast
- Process an entire month of Street Manager archived permit data ready for analysis in 5 minutes.
- Process an entire year of Street Manager archived permit data ready for anylsis in around 1 hour.
- Process all Street Manager archived pemit data from 2020 to 2024 in the morning and be ready to analyse the data in the afternoon.
- The pipeline utilises batch processing so no need to download, unzip, and deal with saving files to disk - everything is kept in memory.
- It's not fussy
- Run it where you want!
- Run it locally with Docker or a Python Venv if you want.
- Run it on a Google Cloud Function/AWS Lambda Function (with some caveats).
- Run it as Fargate Task on AWS or a Google Compute Engine.
- It's flexible
- The project is modular so you can customise it to fit your own needs.
- Don't want to use AWS Secrets Manager for environment variables? Fine! Use another provider or a simple .env file (recommended for local dev only).
- Don't want to use MotherDuck as your data warehouse? Fine! Add in a function so the end destination is Google Big Query instead.
- Only want to focus on Street Manager data? Fine! Launch the entry point that doesn't process SRWR data.
- You can integrate other tools from the Modern Data Stack such as DLT, DBT, or orchestrators like Airflow and Mage if you want more functionality.
- You can run several instances of the project for different purposes.
Both DfT's Street Manager and Scotland's SRWR are the authoritative sources of street work permit data for England and Scotland.
They make available large quantities of archived permit data every month and have done so for several years.
This equates to a lot of data and processing it can be slow and painful if you're not careful.
This project can help you:
- Maintain a consistent and structured way to develop, test, and deploy street work permit data pipelines.
- Automate your development and deployment so you can focus on analysis and delivering value from the data.
- Utilise the power of Cloud Compute to process data faster.
- Utilise elements of the Modern Data Stack to allow for easy Dashboarding.
Important
This is only meant as a quickstart guide. It's not perfect and I'll try to make it more detailed as time goes on :)
You will need to complete a few extra steps before being able to fully use this pipeline so please read the pre-requisites.
-
You'll need Python installed locally on your system - I use Python 3.11.
-
You'll need a MotherDuck account.
-
You'll need an AWS service account.
-
You'll need both the AWS CLI and Terraform correctly configured on your local system.
-
You'll need to ensure that you have the correct environment variables.
If you follow my set up you'll need...
- AWS Secret Manager.
- A MotherDuck connection token.
- A MotherDuck database name.
If you want to use MotherDuck as your end destination then you'll need to make sure that you have env variables set up that point to the correct database and schemas.
git clone https://github.com/CHRISCARLON/Open-Street-Works-Data-Pipeline.git
cd Open-Street-Works-Data-Pipeline
This will create a local copy of the repository on your machine and take you into it.
python -m venv venv
source venv/bin/activate
pip install poetry
poetry install
Poetry will read the pyproject.toml file and install the required packages in your newly created virtual environment.
Create an AWS account if you don't have one already.
If you're not comfortable using the AWS CLI then navigate to the AWS Secrets Manager console in your browser and create a new secret to store your environment variables.
At runtime, call the below function in the code to retrieve your secrets using the AWS SDK (boto3).
Please note that there are charges when using AWS Secrets Manager - but it's peanuts.
Alternatively, you can use a different method to manage environment variables, such as a .env file or another secrets management tool.
Sign up for a MotherDuck account at MotherDuck.
Obtain the necessary credentials and connection details for your MotherDuck account - a MotherDuck Token, for example.
You can use the default my_db that is provided with your new MotherDuck account, but I recommend that you create a specific database for your permit data.
Create database schemas for each year - this is where you'll load your data tables for each month of the relevant year.
For example...
Note
For the most recent permit data use monthly_permit_main.py and for bulk historic data use historic_permit_main.py.
For this example use the most recent permit data.
Important
Make sure that you have configured everything before running.
I'd recommend going through the monthly_permit_main.py file and become comfortable with it first.
Ensure that your MotherDuck token is accessible at runtime and that you have the correct schemas set up. For local development, you could EXPORT the token via the terminal.
Run the following commands to launch the pipeline and process your data.
python src/monthly_permit_main.py
or if you want to use Docker
docker-compose up
This docker-compose.yml is already set up and so is the Dockerfile.
This will execute the pipeline, process the Street Manager permit data, and then load it into MotherDuck for further processing and/or analytics.
This project leverages the power of DBT to make analysis quick, easy, and scalable.
If you run this in a Python venv then you will need to run the run_dbt_jobs.sh seperately to run the DBT models. If don't run the DBT models you will only have the raw data tables in MotherDuck.
There are 10+ DBT models that automate the creation of different aggregated analysis tables - each one is focused on a different type of insight.
To run the dbt shell script open up your terminal cd into the DBT folder and run:
chmod +x run_dbt_jobs.sh
and then
./run_dbt_jobs.sh
If you run this in Docker then the DBT models automatically run.
You can use MotherDuck's UI to run further queries on the data if you'd like to!
MotherDuck connects into many modern BI tools such as HEX and Preset.
Check out the videos below from MotherDuck for some cool tutorials on how to build dashboards with data in MotherDuck.
Note
I'm still writing this up!
It will contain instructions on:
-
Deploying a Docker Container to an AWS ECR (making use the of the handy Makefile!)
-
Using Terraform to configure and deploy an AWS Fargate Task on AWS ECS.