Skip to content

NAUniSeq: A fast computational pipeline to search unique sequences for microbial diagnostics

License

Notifications You must be signed in to change notification settings

Gulshan-gaur/NAUniSeq

Repository files navigation

GitHub release made-for-VSCode Github all releases GitHub watchers git Visual Studio Code PyPi

forthebadge made-with-python

Project Title

A brief description of what this project does and who it's for

NAUniSeq: A fast computational pipeline to search unique sequences for microbial diagnostics

NAUniSeq Manual

  1. Introduction
  2. Principle of operation
  3. Download of sequence data
  4. Collection in MongoDB
  5. K-mer strategies
  6. Generation of unique sequences

Introduction

NAUniSeq (short for Nucleotide and Amino Acid unique sequences) pipeline was developed to search for unique nucleotide and amino acid sequences. The original aim was to find unique sites to design primers, probes, antigenic sites for specific detection of microorganism. It takes target ("microorganism for which we want to find nucleotide and amino acid unique sequences") and non-target (Complete genome and protein sequences taken from NCBI Refseq FTP server minus target sequence) sequences, generates their K-mers and cross referencing them finds unique sequences that are unique for target genome(s). The purpose of our pipeline is to find unique sequences for diagnostic use. Although possible use cases are not limited, Use of our pipeline can be extended to plant diseases also. We have used this method for designing unique sequences of nucleotides and amino acids but this pipeline is also applicable for designing unique RNA sequences. Complete transcriptomic data can be downloaded as rna.fna.gz files from NCBI Refseq.

System requirement and dependencies

For the present study we have used a desktop system with configurations: Intel core i7 10700K, PCI Graphics Card RTX 1660 and RAM 16 GBx2. Aim of the present invention was to provide a pipeline named NAUniSeq, implemented in the Python 3.7 programming language and ran on the Linux or Unix command line. Ubuntu 20.04, MongoDB 5.0.6 and VS Code 1.67.2 versions were used in the development of this pipeline.

Repository Contents

Phylogeny Analysis Code Repository

This repository contains code for analyzing phylogenetic data using a combination of network analysis, seedmer creation, and unique sequence generation. The code is implemented in Python and utilizes various libraries for data processing and analysis.

Usage

Clone this repo

git clone https://github.com/Gulshan-gaur/NAUniSeq.git 

Use cd to naviagte to the test_data folder from where you have cloned the repo

cd $(pwd)/NAUniSeq/test_data

Test Data

In test_data folder

  • taxadb.csv: CSV file containing taxonomy data.
  • refseq.csv: CSV file containing reference sequence data.
  • ng_url.txt: Text file containing FTP links for genome sequences of Neisseria Gonorrhea. Use the provided parallel command to download the multiple files and this command has to run in test_data folder. (This is only works in Ubuntu)
parallel -j 4 wget < ng_url.txt
cd ..
4 is number of process you can choose acc. to your need.

Add data to mongoDb database

Make sure you have MongoDb installed on your system.

#install the pymongo and pandas with pip
!pip install pymongo pandas

Run the insert_refseq_to_mongodb.py to add refseq data to your local instance of mongodb

python insert_refseq_to_mongodb.py

Docker Installation

Make sure you have Docker installed on your system. You can download and install Docker from Docker's official website.

Running the Containerized Application

1. Pull the Docker Image

docker pull gaurgulshan/nauniseq:latest

2. Tag the image with just the repository name:

docker tag gaurgulshan/nauniseq:latest nauniseq:latest

3. NoSQL Method

To run the NoSQL method, execute the following command: Please refer to the individual script files for more detailed comments and explanations of the code.

docker run -v $(pwd)/test_data:/app/test_data -it nauniseq python main.py no-sql --mongodb-uri 'mongodb://localhost:27017/' --taxid 485 --k 100

4. Phylogeny Analysis

To run the phylogeny analysis method, execute the following command:

docker run -v $(pwd)/test_data:/app/test_data -it nauniseq python main.py phylogeny --taxadb-csv 'taxa_db.csv' --refseq-csv 'refseq.csv' --taxid 485 --k 100

Repository Structure

  • noSql.py: This script implements the NoSQL approach for phylogeny analysis using MongoDB as the database. It connects to the MongoDB server, retrieves taxonomic and genomic data, and performs the necessary steps for seedmer creation and unique sequence generation.

  • operationKmer.py: This module provides functions related to k-mer operations, such as creating seedmers and generating unique sequences.

  • phylogenyTree.py: This module defines the PhylogenyTree class, which represents the phylogenetic tree and provides methods for accessing taxonomic and genomic data.

  • phylogenyUS.py: This script implements the phylogeny analysis using the PhylogenyTree class. It creates the phylogenetic tree, performs seedmer creation and unique sequence generation for target and non-target taxa, and displays the unique k-mers.

  • seedmer_data.py: This module defines the seedmer dictionary, which stores k-mers and their associated information.

  • seedmerCreation.py: This module contains the function for creating seedmers from genomic files using a sliding window technique.

  • countUniqueSeedmer.py : This module can count the unique sequence that are ovarlapped.

  • qblast.py : Performing for blastp and blastn

  • README.md: This file provides an overview of the code repository, its structure, and usage instructions.

Authors

Licence

MIT License

We Build The Future❤️

saythanks

Please let me know if you need any further assistance!

About

NAUniSeq: A fast computational pipeline to search unique sequences for microbial diagnostics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published