Skip to content

Intel lab's open sourced data science framework for accelerating digital biology

License

Notifications You must be signed in to change notification settings

IntelLabs/Open-Omics-Acceleration-Framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Downloads

Open Omics Acceleration Framework

Intel lab's open sourced data science framework for accelerating digital biology

Introduction

We are in the epoch of digital biology, that is fueled by the convergence of three revolutions: 1) Measurement of biological systems at high resolution resulting in massive multi-modal, multi-scale, unstructured, distributed data, 2) Novel data science (AI and data management) techniques on this data, and 3) Wide-spread cloud use enabling massive compute and public data repositories, large collaborative projects and consortia. It will require computing and data management at unprecedented scale and speed. However, performance alone would not suffice if it significantly compromised the productivity of biologists and data scientists who are at the forefront of this transformation.

With a goal to build a performant, cost effective and productive platform, we are building Open Omics acceleration framework: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:

  • Pipeline layer: for users who are looking for one click solution to run standard pipelines. Currently, we support the following pipelines:
  • Toolkit (applications) layer: for users who want to use individual tools or to create their own custom pipelines by combining various tools.
  • Building blocks (lib) layer: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools.


With a goal of providing a one-stop platform, this framework brings our following repositories for digital biology under one umbrella:

  • Architecture efficient versions of several popular applications as part of our toolkit layer (under 'applications' folder)
Original Application Our architecure-efficient version
Short read sequence mapping tool, BWA-MEM BWA-MEM2
long read sequence mapping tool, minimap2 mm2-fast
Deep learning based variant calling tool, DeepVariant Open-Omics-DeepVariant
Deep learning based tool for protein structure prediction, AlphaFold2 Open-Omics-AlphaFold
Tool for biological sequence analysis using profile HMMs, HMMER IntelLabs HMMER
Tool for HMM based sensitive protein sequence searching, HH-suite IntelLabs HH-suite
  • Trans-Omics Acceleration Library: As part of our building blocks layer (under 'lib' folder), this is a library containing architecture-efficient versions of key algorithms and data structures used for Omics analysis.

In addition, we also use several existing AI libraries: oneDNN, oneDAL, oneCCL, Katana Graph, LIBXSMM.

Getting Started

# Download release
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/2.1/Source_code_with_submodules.tar.gz 
tar -xzf Source_code_with_submodules.tar.gz

# Clone master
git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework

# Go to the pipelines directory
cd pipelines
# For running a specific pipeline, follow the instructions in the respective pipeline's README file.

# Go to the directory with toolkit
cd applications

# Go to the directory with biology building blocks to access Trans-Omics Acceleration Library
cd lib/tal

Blogs & Related News

Publications

  • GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis. Yufeng Gu, Arun Subramaniyan, Tim Dunn, Alireza Khadem, Kuan-Yu Chen, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das. Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA); June, 2023. https://dl.acm.org/doi/abs/10.1145/3579371.3589060.
  • Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs. Narendra Chaudhary, Alexander Pivovar, Pavel Yakovlev, Andrey Gorshkov and Sanchit Misra. arXiv preprint arXiv:2212.11506; Dec, 2022; doi: https://doi.org/10.48550/arXiv.2212.11506.
  • Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data. Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman and Bharat Kaul. 21st IEEE International Workshop on High Performance Computational Biology (HiCOMB) May 30, 2022. https://ieeexplore.ieee.org/abstract/document/9835674
  • Accelerating minimap2 for long-read sequencing applications on modern CPUs. Saurabh Kalikar, Chirag Jain, Md Vasimuddin, Sanchit Misra. Nature Computational Science 2 (2), 78-83, Feb, 2022. https://rdcu.be/cHVAK.
  • GenomicsBench: A Benchmark Suite for Genomics. Arun Subramaniyan, Yufeng Gu, Timothy Dunn, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021.https://ieeexplore.ieee.org/document/9408208.
  • LISA: Learned indexes for sequence analysis. Darryl Ho, Saurabh Kalikar, Sanchit Misra, Jialin Ding, Vasimuddin Md, Nesime Tatbul, Heng Li, Tim Kraska. bioRxiv 2020.12.22.423964; doi: https://doi.org/10.1101/2020.12.22.423964.
  • Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. https://ieeexplore.ieee.org/document/8820962.
  • Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis. Sanchit Misra, Tony Pan, Kanak Mahadik, George Powley, Priya N Vaidya, Md Vasimuddin, Srinivas Aluru. International Conference on Parallel Architectures and Compilation Techniques (PACT), 2018. https://dl.acm.org/doi/abs/10.1145/3243176.3243197.
  • Identification of Significant Computational Building Blocks through Comprehensive Deep Dive of NGS Secondary Analysis Methods. Md Vasimuddin, Sanchit Misra, Srinivas Aluru. BioRxiv 2018 301903. https://www.biorxiv.org/content/10.1101/301903v3.abstract.