Skip to content

The Data Pipeline using Google Cloud Dataproc, Cloud Storage and BigQuery

Notifications You must be signed in to change notification settings

bozzlab/pyspark-dataproc-gcs-to-bigquery

Repository files navigation

Pyspark Example Pipeline using Google Cloud Dataproc

Prerequisite

Python >=3.6   
Google Cloud Platform

Please following below,

  1. Create bucket for initialization actions, Then copy install script to bucket
gsutil mb <bucket_name>
gsutil cp initz_action/install.sh gs://<bucket_name>
  1. Enable Dataproc API Service

  2. Enable BigQuery API Service

  3. Generate mockup data

python generator_sentance.py 

You will receive new text file, Then copy the file to GCS

gsutil cp text_sample.txt gs://<bucket_name>/text_sample.txt
  1. Create Dataproc cluster, Just waiting until the cluster has created.
bash dataproc_cluster_scripts/create.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <INITIALIZATION_ACTION_GCS_LOCATION>

For instance,

bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/initz_action/install.sh
  1. Assign for Python enviroment
export DRIVER=yarn # if you want to run on local development please assign "local" to DRIVER
export PROJECT_ID=<PROJECT_ID> # The Google Cloud Project ID
export DATASET=<DATASET> # The Dataset name on BigQuery
export TABLE=<TABLE> # The Table name on BigQuery
  1. Submit Pyspark job to Dataproc
bash exec.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <gs://<bucket_name>/text_sample.txt>

For instance,

bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/text_sample.txt
  1. (Optional) Delete Cluster
bash dataproc_cluster_scripts/delete.sh --cluster_name <CLUSTER_NAME> --region <REGION> 

For instance,

bash dataproc_cluster_scripts/delete.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1

About

The Data Pipeline using Google Cloud Dataproc, Cloud Storage and BigQuery

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published