Skip to content

The objectives of this project are to get experience of coding with: Spark, Spark SQL, Spark Streaming, Kafka, Scala and functional programming

Notifications You must be signed in to change notification settings

ManikHossain08/Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

The objectives of this project are to get experience of coding with:

  • Spark
  • Spark SQL
  • Spark Streaming
  • Kafka
  • Scala and functional programming

DATA SET

The data set is the one that you analyzed in Course 1 and it is STM GTFS data.

PROBLEM STATEMENT

We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two

  1. A set of tables that build dimension (batch style)
  2. Stop times that needed to be enriched for analysis and reporting (streaming)

image

image

About

The objectives of this project are to get experience of coding with: Spark, Spark SQL, Spark Streaming, Kafka, Scala and functional programming

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages