bgp-d2

Here is our implementation for distributed D2 distance algorithm for k-mers by using the Apache Hadoop framework.

bgp-d2

Introduction

Among several alignment-free methods to calculate similarity between two strings, we picked one (D2) based on word statistics, specifically their frequency in a sequence.

Once all possible k-mers into the two sequences have been determined, to calculate the distance among them we'll use the D2 function.

Both sequential and distributed implementation of D2 algorithm take as input the result of KMC tool output: KMC allows to count k-mers in one or more genomic sequences; in our case, it has been used to count from k = 3 up to k = 13. For each sequence, a k-mer occurrency file has been generated.

Distributed D2

Distributed D2 implementation consist of a first MapReduce phase (to read k-mers occurrences from KMC output file and calculate partial D2 scores) and an eventual second one where if more than one task is created to sum partial scores.

Hadoop cluster configuration

Hardware

Test cluster machines had the following configuration:

CPU: Intel Xeon E3-12xx v2 (Ivy Bridge), 8 cores
RAM: 32 GB
OS: Ubuntu 16.04.4 LTS

Hadoop configuration

Each Hadoop node had the following configuration:

yarn-site.xml
- yarn.nodemanager.resource.memory-mb: 30720
- yarn.nodemanager.resource.cpu-vcores: 8
hdfs-site.xml
- dfs.replication: 1
- dfs.blocksize: 64m
mapred-site.xml
- mapreduce.map.memory.mb: 4096
- mapreduce.reduce.memory.mb: 7168
- mapreduce.map.java.opts: -Xmx3276M
- mapreduce.reduce.java.opts: -Xmx5734M
- mapreduce.[map|reduce].cpu.vcores: 2

Repository

The bgp-d2 repository consist of three main folders:

Installation

Both sequential and distributed projects can be built by running the following command:

mvn clean compile javadoc:javadoc

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
KMC @ 8dbd9d1		KMC @ 8dbd9d1
benchmark		benchmark
distributed/d2d		distributed/d2d
sequential/d2		sequential/d2
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KMC @ 8dbd9d1

KMC @ 8dbd9d1

benchmark

benchmark

distributed/d2d

distributed/d2d

sequential/d2

sequential/d2

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

Repository files navigation

bgp-d2

Introduction

Distributed D2

Hadoop cluster configuration

Hardware

Hadoop configuration

Repository

Installation

References

Authors

About

Releases

Packages

Contributors 3

Languages

bissim/bgp-d2

Folders and files

Latest commit

History

Repository files navigation

bgp-d2

Introduction

Distributed D2

Hadoop cluster configuration

Hardware

Hadoop configuration

Repository

Installation

References

Authors

About

Topics

Resources

Stars

Watchers

Forks

Languages