Skip to content

Real-time streaming data quality validation project using NYC Taxi Rides datasets, leveraging Kafka, Flink, and StreamDQ.

Notifications You must be signed in to change notification settings

zy969/streaming-data-quality-validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Streaming Data Quality Validation

Overview

This project aims to validate the quality of streaming data in real-time, using NYC Taxi Rides datasets as a case study. It leverages the power of Kafka for streaming, Flink for data processing, and StreamDQ for data quality validation, ensuring high-quality training data for modern data-driven applications.

Requirements

  • Docker
  • Docker-compose
  • Python 3
  • Java 11
  • Maven
  • StreamDQ

Framework

  • Google Cloud Storage (for data storage)
  • Kafka (for streaming)
  • Flink (for processing)
  • StreamDQ (for data quality validation)

Workflow

Datasets

Our experiments rely on the NYC green taxi trip records datasets spanning from July 2022 to December 2023. These datasets provide a comprehensive view of taxi activities in New York City, uncovering travel patterns, fare insights, and service usage. We automate the download and upload of these datasets to Google Cloud Storage using the script upload-file-to-gcp.py.

Usage

To ensure a smooth execution, please make sure all the requirements listed in the Requirements section are properly installed on your system.

  1. Clone the repository:

    git clone https://github.com/zy969/streaming-data-quality-validation.git
  2. Replace the local Maven streamdq JAR with our modified version from src/main/resources/lib/streamdq-1.0-SNAPSHOT.jar and update key.json with your own Google Cloud key.

  3. Build the Docker image and run Docker Containers:

    ./scripts/build-and-run.sh
  4. To monitor the logs of the running containers:

    docker-compose logs
  5. To stop and remove containers:

    docker-compose down

Troubleshooting

  • Java Version Issue: Run mvn --version to ensure your Java version is 11.

  • Bash Script Issue: If encountering bash script errors, ensure the script uses LF (Unix) line endings instead of CRLF (Windows). Use a text editor or dos2unix to convert the line endings.

About

Real-time streaming data quality validation project using NYC Taxi Rides datasets, leveraging Kafka, Flink, and StreamDQ.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published