Gaoya

About

This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.

Main Features

64,32,16,8 bit minhash
64,128 bit simhash
Fast implementation in Rust
Multi-threaded thanks to rayon
Python bindings

Python Example

>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>>

Installation

$ pip3 install gaoya

Examples

Document Deduplication with Gaoya

Rust Example

use gaoya::minhash::{MinHashIndex, MinHasher32, MinHasher} ;
use gaoya::text::whitespace_split;
use fxhash::FxHashSet;
let corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third document.",
    "Is this the first document?",
    "This not the first nor the second nor the third, but the fourth document"];
let (num_bands, band_width) = (42, 3);
let minhasher = MinHasher32::new(num_bands * band_width);
let mut index = MinHashIndex::new(num_bands, band_width, 0.5);
for (i, doc) in corpus.iter().enumerate() {
    index.insert(i, minhasher.create_signature(whitespace_split(&doc.to_lowercase())));
}
for (i, doc) in corpus.iter().enumerate() {
    if i < 4 {
        let mut expected = FxHashSet::default();
        expected.extend(vec![0, 1, 2, 3].into_iter());
        let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
        assert_eq!(index.query_owned(&signature), expected);
    } else {
        let mut expected = FxHashSet::default();
        expected.insert(4);
        let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
        assert_eq!(index.query_owned(&signature), expected);
    }
}

References

[1] Chapter 3, Mining of Massive Datasets

[2] Similarity Estimation Techniques from Rounding Algorithms

[3] Detecting Near-Duplicates for Web Crawling

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github		.github
gaoya		gaoya
py-gaoya		py-gaoya
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

gaoya

gaoya

py-gaoya

py-gaoya

CHANGELOG.md

CHANGELOG.md

Cargo.toml

Cargo.toml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Gaoya

About

Main Features

Python Example

Installation

Examples

Rust Example

References

About

Releases 3

Packages

Languages

License

serega/gaoya

Folders and files

Latest commit

History

Repository files navigation

Gaoya

About

Main Features

Python Example

Installation

Examples

Rust Example

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages