The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
May 14, 2024 - Python
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
The open-source tool for building high-quality datasets and computer vision models
A Doctor for your data
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Interactively explore unstructured datasets from your dataframe.
A curated, but incomplete, list of data-centric AI resources.
Curated list of open source tooling for data-centric AI on unstructured data.
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
Lesson guide and textbook for "Data as a Science" course.
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and clips models for your purposes. Custom datasets can be added!
Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)
Client interface for all things Cleanlab Studio
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
Curated list of known efforts in collecting and/or curating of chemical/materials data
AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.
Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
Add a description, image, and links to the data-curation topic page so that developers can more easily learn about it.
To associate your repository with the data-curation topic, visit your repo's landing page and select "manage topics."