Streaming Data Quality Validation

Overview

This project aims to validate the quality of streaming data in real-time, using NYC Taxi Rides datasets as a case study. It leverages the power of Kafka for streaming, Flink for data processing, and StreamDQ for data quality validation, ensuring high-quality training data for modern data-driven applications.

Requirements

Docker
Docker-compose
Python 3
Java 11
Maven
StreamDQ

Framework

Google Cloud Storage (for data storage)
Kafka (for streaming)
Flink (for processing)
StreamDQ (for data quality validation)

Datasets

Our experiments rely on the NYC green taxi trip records datasets spanning from July 2022 to December 2023. These datasets provide a comprehensive view of taxi activities in New York City, uncovering travel patterns, fare insights, and service usage. We automate the download and upload of these datasets to Google Cloud Storage using the script upload-file-to-gcp.py.

Usage

To ensure a smooth execution, please make sure all the requirements listed in the Requirements section are properly installed on your system.

Clone the repository:

git clone https://github.com/zy969/streaming-data-quality-validation.git

Replace the local Maven streamdq JAR with our modified version from src/main/resources/lib/streamdq-1.0-SNAPSHOT.jar and update key.json with your own Google Cloud key.
Build the Docker image and run Docker Containers:
```
./scripts/build-and-run.sh
```
To monitor the logs of the running containers:
```
docker-compose logs
```
To stop and remove containers:
```
docker-compose down
```

Troubleshooting

Java Version Issue: Run mvn --version to ensure your Java version is 11.
Bash Script Issue: If encountering bash script errors, ensure the script uses LF (Unix) line endings instead of CRLF (Windows). Use a text editor or dos2unix to convert the line endings.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
scripts		scripts
src/main		src/main
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
key.json		key.json
pom.xml		pom.xml
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

src/main

src/main

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

docker-compose.yml

docker-compose.yml

key.json

key.json

pom.xml

pom.xml

workflow.png

workflow.png

Repository files navigation

Streaming Data Quality Validation

Overview

Requirements

Framework

Datasets

Usage

Troubleshooting

About

Releases

Packages

Contributors 3

Languages

zy969/streaming-data-quality-validation

Folders and files

Latest commit

History

Repository files navigation

Streaming Data Quality Validation

Overview

Requirements

Framework

Datasets

Usage

Troubleshooting

About

Topics

Resources

Stars

Watchers

Forks

Languages