Skip to content

Repository for exploration of new ideas, concepts and tools

Notifications You must be signed in to change notification settings

paty-oliveira/development-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Development Notes

This repository stores useful solutions for problems that I was facing on during developments.

Amazon Web Services offers many ways for querying data in DynamoDB:

  1. AWS Console for DynamoDB
  2. NoSQL Workbench for DynamoDB
  3. AWS CLI
  4. AWS SDK

There are a lack of flexibility querying data using the AWS Console or NoSQL Workbench. Thus, another way to process data from DynamoDB is to build a script using AWS SDK. For this situation, I used boto3 library for Python.

This option allows me to build a simple script to retrieve data from DynamoDB table, and process it using common libraries such as pandas. This approach gave me a lot of freedom to test locally and process data in a intuitive way, avoiding the addiction of a new component to the infrastructure, such as AWS Lambda function.

Another important point: DynamoDB calls through SDK does not have costs associated, while querying data using Console or Workbench has.

Deequ is a AWS library built on top of Apache Spark for defining unit-test for data, which measure data quality in large datasets. Some of data quality metrics available:

  • Completeness
  • Uniqueness
  • Consistency
  • Size

Deequ provides features like:

  • Data Profiling: supports single-column profiling of such data;
  • Constraint Suggestions: provides built-in functionality to assist users in finding reasonable constraints for their data;
  • Metrics Computation: once we know what to test, we can run tests to compute the metrics;
  • Constraint Verification: we can put test cases and get result to be used for the reporting.

This tool can be integrated in AWS services, such as AWS Glue and SageMaker, or locally using PyDeequ library for Python. The evaluation of data quality metrics can be performed in many steps of ETL or ELT pipeline, such as:

  • Validation of source data
  • Validation of source data load
  • Validation of transformed data

About

Repository for exploration of new ideas, concepts and tools

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages