Python function to generate a mask analysis
-
Updated
Jul 22, 2017 - Jupyter Notebook
Python function to generate a mask analysis
Simple Spark wrapper for validating data
Generates a match score of two person names from 0-100, where 100 is the highest, on how closely two individual full names match. The scoring is based on a series of tests, algorithms, AI, and an ever-growing body of Machine Learning-based generated knowledge
FIMUS imputes numerical and categorical missing values by using a data set’s existing patterns including co-appearances of attribute values, correlations among the attributes and similarity of values belonging to an attribute.
Project for the "Data and Information Quality" course at Politecnico di Milano - AY 2023/2024 - Data Issues: Duplication, Variable Types - ML Task: Classification
Jam MA-plots, volcano plots, other relevant genomics visualizations
🚚 Agile Data Science Workflows made easy with Pyspark
DsProfiling – Dataset Profiling
Implementation of data typology for imbalanced datasets.
This repository provides R scripts for reproducing virtual species generating, modeling species distribution and final figures related with published manuscript.
Aceleração Pyspark Capgemini 2022
This GitHub repository provides a comprehensive set of tools and algorithms for detecting fraud anomalies in various data sources. Fraudulent activities can have severe consequences, impacting businesses and individuals alike. With this repository, we aim to empower researchers with effective techniques to identify and prevent fraudulent behavior.
Data quality checks in your dbt flow
PoC for Soda Contracts against Vertica DB
DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻❄️ DataFrame comparison library)
Profiles the fields to generate statistics on each column specified.
DsFeatFreqComp – Dataset Feature-Frequency Comparison R Package
Scripts I wrote at my job which could be helpful to others
The guidelines to help you to manage your antarctic biodiversity data
This is a tool developed in Python to assist with the data governance process, particularly during the migration project Mainframe>MDM>PIC. The team checks the integrity of the data and evaluate business rules are being fullfiled by synchronizing the data between the MDM platform and the current item information on Mainframe. This tool's purpose…
Add a description, image, and links to the data-quality topic page so that developers can more easily learn about it.
To associate your repository with the data-quality topic, visit your repo's landing page and select "manage topics."