Skip to content
/ bgp-d2 Public

This is our implementation for D2 distance algorithm for k-mers.

Notifications You must be signed in to change notification settings

bissim/bgp-d2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bgp-d2

Here is our implementation for distributed D2 distance algorithm for k-mers by using the Apache Hadoop framework.

Introduction

Among several alignment-free methods to calculate similarity between two strings, we picked one (D2) based on word statistics, specifically their frequency in a sequence.

Once all possible k-mers into the two sequences have been determined, to calculate the distance among them we'll use the D2 function.

Both sequential and distributed implementation of D2 algorithm take as input the result of KMC tool output: KMC allows to count k-mers in one or more genomic sequences; in our case, it has been used to count from k = 3 up to k = 13. For each sequence, a k-mer occurrency file has been generated.

Distributed D2

Distributed D2 implementation consist of a first MapReduce phase (to read k-mers occurrences from KMC output file and calculate partial D2 scores) and an eventual second one where if more than one task is created to sum partial scores.

Hadoop cluster configuration

Hardware

Test cluster machines had the following configuration:

  • CPU: Intel Xeon E3-12xx v2 (Ivy Bridge), 8 cores
  • RAM: 32 GB
  • OS: Ubuntu 16.04.4 LTS

Hadoop configuration

Each Hadoop node had the following configuration:

  • yarn-site.xml
    • yarn.nodemanager.resource.memory-mb: 30720
    • yarn.nodemanager.resource.cpu-vcores: 8
  • hdfs-site.xml
    • dfs.replication: 1
    • dfs.blocksize: 64m
  • mapred-site.xml
    • mapreduce.map.memory.mb: 4096
    • mapreduce.reduce.memory.mb: 7168
    • mapreduce.map.java.opts: -Xmx3276M
    • mapreduce.reduce.java.opts: -Xmx5734M
    • mapreduce.[map|reduce].cpu.vcores: 2

Repository

The bgp-d2 repository consist of three main folders:

Installation

Both sequential and distributed projects can be built by running the following command:

mvn clean compile javadoc:javadoc

References

Authors