Dagster & DBT simple example

By the time of the first commit of this project, I have worked at AnglicareSA for a year and wanted to do something to remember the anniversary. I read the article "The Unbundling of Airflow" by Gorkem Yurtseven at Fal.ai by a chance a few months ago. Coincidentally, I was looking for an on-premises alternative to SSIS. I then finished a prototype of that architecture using Dagster, DBT, Metabase, which are things listed in the article, at my organisation. As the architecture is simple and easy to setup plus my friends and colleagues are interested in Data Engineering, I'd like to share it with everyone who has the same interest. I, however, used the actual data in the org system for that prototype, so I cannot share the prototype straight away but create this project with open data, which is IMDB.

Imagine you have these constraints for your system to run a data warehouse: Windows servers only, everything must run on-premise, and commercial or cloud-based tools may be unaffordable. Or, you are just a student who wants pursue a career as a Data Engineer and just don't want to spend too much time on learning and setting up a whole architecture on cloud. Cloud services setup sometimes is not for everyone and can hinder you from learning something quickly. For those reasons, I have come up with this architecture as a personal, easy-to-setup architecture.

Dagster is the best alternative to Airflow, which only runs on Linux, and really easy to setup to run locally on a Windows machine. It has cool features for automating jobs such as Schedules, Sensor, Partitions, and Backfills. I used a simple idea for the data pipeline in IMDB project that comprises 3 steps:

Downloading 3 zip files from IMDB.
Load them to a staging database.
Trigger DBT operation to transform data in the staging and load final result to final database, considered a Data Warehouse.

The project also demonstrates how to parameterise connections, configure tasks, and schedule a job in Dagster. The project merely uses Schedule feature, but you add a new scenario to use the Sensor feature where job now listens to changes of an SFTP/FTP/folder to trigger the pipeline.

DBT is the rising star in Data Transformation. If you're a Data Analyst, a Data Engineer, or a Data Scientist, SQL is compulsory right?! Instead of using low-code solutions like SSIS/Azure Data Factory that don't show you the actual transformation logic and requires triggering the pipeline to debug and view intermediate output, DBT is like an SQL template engine where you can compile, debug, test, and run your SQL statements. You know exactly what will happen in the Data Transformation step regardless of ETL or ELT. What if a Data Analyst has a new idea for their BI report and doesn't want to bother you with implementing the transformation logic for them? They can test their logic in SQL Management studio and ask you to put it into the DBT project, or they just do it themselves :). The beauty here is Data Analyst would be able to collaborate on a data pipeline development. Moreover, DBT has a built-in documentation generator so that you can make your work more transparent to your colleagues. The documentation site even generates a model lineage! Have you ever had to take over an SSIS project and asked yourself where to start to fix a model in a package? If the answer is yes, you'll definitely love DBT's documentation site.

Here is the screenshot of this architecture. You can install your favourite Python packages to this project and play with it. As I didn't want to write too much code for SQLite and want my logic portable, I used SQLalchemy and pandas for data extraction. It means you can replace SQLite with your selected one with very less code changes. Some ideas couldn't be implemented due to using SQLite. Descriptive Statistics tables were there because Metabase's SQLite driver or implementation is just wonky.

Dataset

This example uses datasets that are shared by IMDB for non-commercial and personal use. You can find more about its license here.

Prerequisites

Visual Studio Code
Python 3.9.x
Docker Desktop (Community Edition)

Folder structure

Dagster project is put into "imdb" folder.
DBT project lives in "imdb_dbt".
"raw_files" folder will be generated when the "setup.ps1" script is running.
"metabase" stores information of the IMDB Dashboard and Questions. It also stores the account to access to Metabase as it doesn't have a default account. Please use this username and password to login: admin@metabase.local:config1.

Setup

Please follow these steps:

Setup virtual environment and required package installation.

.\setup.ps1

Start Dagster Daemon and UI. You should see the UI server hosted at port 3000

.\start_dagster.ps1

Run this script to generate the documentation site that shows model lineage and models' description. Note that, You should run this tool after running the data pipeline in Dagster UI.

.\start_dbt_doc.ps1

Run Metabase. Note that, You should run this tool after running the data pipeline in Dagster UI.

.\start_metabase.ps1

After starting Dagster, please go to localhost:3000. You should see 2 repositories like in the screenshot.

Then choose "local_job_etl" job from "imdb_local". Select "Launch Pad" tab, then you'll see the configuration of this job and its scheduled time.

When you click on "Schedule: At 06:00 ACST", you'll see the details of this scheduled task like in the screenshot below.

Go back to the previous screen and trigger the pipeline. Dagster will move to the job run details page as this screenshot.

Now go run start_dbt_doc.ps1 script and go to localhost:8080 to see information of staging and final models as well as the model lineage as in the screenshot below.

Next please run start_metabase.ps1 script to start Metabase. Please start Docker prior to this step. Go to localhost:50000 then use the credential given in Folder structure section to login. Click on "IMDB" then "IMDB Dashboard" to view the dashboard as below.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
images		images
imdb		imdb
imdb_dbt		imdb_dbt
metabase		metabase
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bi_stack.yml		bi_stack.yml
dagster.yaml		dagster.yaml
requirements.txt		requirements.txt
setup.ps1		setup.ps1
start_dagster.ps1		start_dagster.ps1
start_dbt_doc.ps1		start_dbt_doc.ps1
start_metabase.ps1		start_metabase.ps1
workspace.yaml		workspace.yaml

License

daihuynh/dagster_dbt_metabase_simple_solution

Folders and files

Latest commit

History

Repository files navigation

Dagster & DBT simple example

Dataset

Prerequisites

Folder structure

Setup

About

Topics

Resources

License

Stars

Watchers

Forks

Languages