Skip to content

Schedule, automate, and monitor data pipelines using Apache Airflow. Run data quality checks, track data lineage, and work with data pipelines in production.

License

Notifications You must be signed in to change notification settings

AbdullahMu/Data-Pipelines-with-Airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

Project: Data Pipelines with Airflow

Introduction

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

The expected delivarables are to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

he source data resides in S3 and needs to be processed in Sparkify’s data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Data Sets

For this project, you’ll be working with two datasets. Here are the s3 links for each:

  • Log data: s3://udacity-dend/log_data
  • Song data: s3://udacity-dend/song_data

# Project Template

The project consists of three major components:

  • The dag template has all the imports and task dependencies.

  • The operators folder with operator templates:

    • The stage operator loads any JSON
      formatted files from S3 to Amazon Redshift. The operator creates and
      runs a SQL COPY statement based on the parameters provided. The
      operator’s parameters should specify where in S3 the file is loaded
      and what is the target table.

    • The dimension and fact operators utilize the provided SQL helper class to run data transformations. Most of the logic is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which to run the query against.

    • The final operator is the data quality operator, which is used to run checks on the data itself. The operator’s main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. For each the test, the test result and expected result needs to be checked and if there is no match, the operator should raise an exception and the task should retry and fail eventually.

  • A helper class for the SQL transformations

Dependencies are set so the graph view follows the flow shown in the image below.
enter image description here

Airflow Connections

For AWS credentials, enter the following values:

  • Conn Id: Enter aws_credentials.
  • Conn Type: Enter Amazon Web Services.
  • Login: Enter your Access key ID from the IAM User credentials you downloaded earlier.
  • Password: Enter your Secret access key from the IAM User credentials you downloaded earlier.

Use the following values in Airflow’s UI to configure connection to Redshift:

  • Conn Id: Enter redshift.
  • Conn Type: Enter Postgres.
  • Host: Enter the endpoint of your Redshift cluster, excluding the port at the end. You can find this by selecting your cluster in the Clusters page of the Amazon Redshift console. See where this is located in the screenshot below. IMPORTANT: Make sure to NOT include the port at the end of the Redshift endpoint string.
  • Schema: Enter dev. This is the Redshift database you want to connect to.
  • Login: Enter awsuser.
  • Password: Enter the password you created when launching your Redshift cluster.
  • Port: Enter 5439.

Instructions

  • Run /opt/airflow/start.sh to start the Airflow webserver.
  • Once the Airflow web server is ready, you can access the Airflow UI by clicking on the blue Access Airflow button.

This project is completed as a part of Udacity Data Engineer Nanodegree program.

Written with StackEdit.

About

Schedule, automate, and monitor data pipelines using Apache Airflow. Run data quality checks, track data lineage, and work with data pipelines in production.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages