Phylogeny Analysis Code Repository

Project Title

A brief description of what this project does and who it's for

NAUniSeq: A fast computational pipeline to search unique sequences for microbial diagnostics

NAUniSeq Manual

Introduction
Principle of operation
Download of sequence data
Collection in MongoDB
K-mer strategies
Generation of unique sequences

Introduction

NAUniSeq (short for Nucleotide and Amino Acid unique sequences) pipeline was developed to search for unique nucleotide and amino acid sequences. The original aim was to find unique sites to design primers, probes, antigenic sites for specific detection of microorganism. It takes target ("microorganism for which we want to find nucleotide and amino acid unique sequences") and non-target (Complete genome and protein sequences taken from NCBI Refseq FTP server minus target sequence) sequences, generates their K-mers and cross referencing them finds unique sequences that are unique for target genome(s). The purpose of our pipeline is to find unique sequences for diagnostic use. Although possible use cases are not limited, Use of our pipeline can be extended to plant diseases also. We have used this method for designing unique sequences of nucleotides and amino acids but this pipeline is also applicable for designing unique RNA sequences. Complete transcriptomic data can be downloaded as rna.fna.gz files from NCBI Refseq.

System requirement and dependencies

For the present study we have used a desktop system with configurations: Intel core i7 10700K, PCI Graphics Card RTX 1660 and RAM 16 GBx2. Aim of the present invention was to provide a pipeline named NAUniSeq, implemented in the Python 3.7 programming language and ran on the Linux or Unix command line. Ubuntu 20.04, MongoDB 5.0.6 and VS Code 1.67.2 versions were used in the development of this pipeline.

Repository Contents

Phylogeny Analysis Code Repository

This repository contains code for analyzing phylogenetic data using a combination of network analysis, seedmer creation, and unique sequence generation. The code is implemented in Python and utilizes various libraries for data processing and analysis.

Usage

Clone this repo

git clone https://github.com/Gulshan-gaur/NAUniSeq.git

Use cd to naviagte to the test_data folder from where you have cloned the repo

cd $(pwd)/NAUniSeq/test_data

Test Data

In test_data folder

taxadb.csv: CSV file containing taxonomy data.
refseq.csv: CSV file containing reference sequence data.
ng_url.txt: Text file containing FTP links for genome sequences of Neisseria Gonorrhea. Use the provided parallel command to download the multiple files and this command has to run in test_data folder. (This is only works in Ubuntu)

parallel -j 4 wget < ng_url.txt
cd ..

4 is number of process you can choose acc. to your need.

Add data to mongoDb database

Make sure you have MongoDb installed on your system.

#install the pymongo and pandas with pip
!pip install pymongo pandas

Run the insert_refseq_to_mongodb.py to add refseq data to your local instance of mongodb

python insert_refseq_to_mongodb.py

Docker Installation

Make sure you have Docker installed on your system. You can download and install Docker from Docker's official website.

Running the Containerized Application

1. Pull the Docker Image

docker pull gaurgulshan/nauniseq:latest

2. Tag the image with just the repository name:

docker tag gaurgulshan/nauniseq:latest nauniseq:latest

3. NoSQL Method

To run the NoSQL method, execute the following command: Please refer to the individual script files for more detailed comments and explanations of the code.

docker run -v $(pwd)/test_data:/app/test_data -it nauniseq python main.py no-sql --mongodb-uri 'mongodb://localhost:27017/' --taxid 485 --k 100

4. Phylogeny Analysis

To run the phylogeny analysis method, execute the following command:

docker run -v $(pwd)/test_data:/app/test_data -it nauniseq python main.py phylogeny --taxadb-csv 'taxa_db.csv' --refseq-csv 'refseq.csv' --taxid 485 --k 100

Repository Structure

noSql.py: This script implements the NoSQL approach for phylogeny analysis using MongoDB as the database. It connects to the MongoDB server, retrieves taxonomic and genomic data, and performs the necessary steps for seedmer creation and unique sequence generation.
operationKmer.py: This module provides functions related to k-mer operations, such as creating seedmers and generating unique sequences.
phylogenyTree.py: This module defines the PhylogenyTree class, which represents the phylogenetic tree and provides methods for accessing taxonomic and genomic data.
phylogenyUS.py: This script implements the phylogeny analysis using the PhylogenyTree class. It creates the phylogenetic tree, performs seedmer creation and unique sequence generation for target and non-target taxa, and displays the unique k-mers.
seedmer_data.py: This module defines the seedmer dictionary, which stores k-mers and their associated information.
seedmerCreation.py: This module contains the function for creating seedmers from genomic files using a sliding window technique.
countUniqueSeedmer.py : This module can count the unique sequence that are ovarlapped.
qblast.py : Performing for blastp and blastn
README.md: This file provides an overview of the code repository, its structure, and usage instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Scripts		Scripts
images		images
test_data		test_data
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
command.txt		command.txt
insert_refseq_to_mongodb.py		insert_refseq_to_mongodb.py
main.py		main.py
requirements.txt		requirements.txt

License

Gulshan-gaur/NAUniSeq

Folders and files

Latest commit

History

Repository files navigation

Project Title

NAUniSeq Manual

Introduction

System requirement and dependencies

Repository Contents

Phylogeny Analysis Code Repository

Usage

Clone this repo

Use cd to naviagte to the test_data folder from where you have cloned the repo

Test Data

4 is number of process you can choose acc. to your need.

Add data to mongoDb database

Docker Installation

Running the Containerized Application

1. Pull the Docker Image

2. Tag the image with just the repository name:

3. NoSQL Method

4. Phylogeny Analysis

Repository Structure

Authors

Licence

We Build The Future❤️

About

Topics

Resources

License

Stars

Watchers

Forks

Languages