Skip to content

Resources for tackling record linkage / deduplication / data matching problems

Notifications You must be signed in to change notification settings

ropeladder/record-linkage-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 

Repository files navigation

Record Linkage Resources

Resources for tackling record linkage (also known as deduplication, data matching, entity resolution)

Note: If you're looking for file deduplication software, you're in the wrong place! This page focuses on deduplicating datasets.

Also note: Nor is this page is not about deduplication software used in backup and storage.

Record linkage attempts to identify duplicate records in messy data. It is a thorny problem that crops up in a variety of scenarios that attempt to understand with real-world entities (most often people), such as census and statistical bureaus, medical organizations, the social sciences, and of course commercial business.

For example, are these records the same person? Record Linkage is how you make the computer decide--quickly.

Name Address Phone
Bill Smith 123 N. Main St. 555-1235
Smith, William K. 123 Main -
W. K. Smith North Main Street 222-555-1234
Bill Schmidt 1230 Main St. 542-1235

Background

Documents

Talks

Books

Free software

(last updated, stars)

Python

Java

R

Spark

Other

Commercial software and solutions

For SAS

Data Cleaning

Name Parsers

Python

JavaScript

Papers

Organizations

Misc

To Do

  • list compatible data sources for software (CSV, SQL DB, JSON, data frame, etc...)
  • GUI or not?
  • list algorithms and techniques for softare (deterministic, probabalistic, graph, etc...)

Suggestions / contributions welcome! I am not an expert on record linkage, this is simply a list of things I've found when working on a difficult deduplication problem for Thicket.io.

About

Resources for tackling record linkage / deduplication / data matching problems

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published