This project aims to validate the quality of streaming data in real-time, using NYC Taxi Rides datasets as a case study. It leverages the power of Kafka for streaming, Flink for data processing, and StreamDQ for data quality validation, ensuring high-quality training data for modern data-driven applications.
- Docker
- Docker-compose
- Python 3
- Java 11
- Maven
- StreamDQ
- Google Cloud Storage (for data storage)
- Kafka (for streaming)
- Flink (for processing)
- StreamDQ (for data quality validation)
Our experiments rely on the NYC green taxi trip records datasets spanning from July 2022 to December 2023. These datasets provide a comprehensive view of taxi activities in New York City, uncovering travel patterns, fare insights, and service usage. We automate the download and upload of these datasets to Google Cloud Storage using the script upload-file-to-gcp.py
.
To ensure a smooth execution, please make sure all the requirements listed in the Requirements section are properly installed on your system.
-
Clone the repository:
git clone https://github.com/zy969/streaming-data-quality-validation.git
-
Replace the local Maven
streamdq
JAR with our modified version fromsrc/main/resources/lib/streamdq-1.0-SNAPSHOT.jar
and updatekey.json
with your own Google Cloud key. -
Build the Docker image and run Docker Containers:
./scripts/build-and-run.sh
-
To monitor the logs of the running containers:
docker-compose logs
-
To stop and remove containers:
docker-compose down
-
Java Version Issue: Run
mvn --version
to ensure your Java version is 11. -
Bash Script Issue: If encountering bash script errors, ensure the script uses LF (Unix) line endings instead of CRLF (Windows). Use a text editor or
dos2unix
to convert the line endings.