CSBFinder-S

Overview
Prerequisites
Running CSBFinder-S
Input files formats
Output files
Example of running CSBFinder-S
User interface features
License
Author
Credit

Overview

CSBFinder-S is a standalone Desktop java application with a graphical user interface, that can also be executed via command line.

TL:DR Watch this video from the ISMB conference to understand what CSBFinder is all about, including some examples on real data.

CSBFinder-S implements a novel methodology for the discovery and ranking of colinear syntenic blocks (CSBs) - groups of genes that are consistently located close to each other, in the same order, across a wide range of taxa. CSBFinder-S incorporates efficient algorithms that identify CSBs in large genomic datasets. The discovered CSBs are ranked according to a probabilistic score and clustered to families according to their gene content similarity.

The overall toolkit includes two components, implementing two distinct algorithms and released in separate versions. The first, denoted CSBFinder (published in (Svetlitsky et. al., 2018), cited below), incorporated a suffix-tree based algorithm, and was optimized to seek single-operon CSBs. The second version, CSBFinder-S (Svetlitsky et. al., 2020, cited below), generalizes the tool to cross-strand, multi operon CSBs and incorporates a match-point arithmetic based algorithm to efficiently support the generalizations.

March 27, 2019 update

CSBFinder-S for the discovery of cross-strand multi-operon CSBs is released

In this version, the user can decide whether to segment the input genomes into directons (consecutive genes on the same strand)
A novel exact algorithm that uses match-point arithmetic is proposed and implemented. The time and space complexities of the algorithm are insensitive to the number of insertions and maximal CSB length. The new algorithm is faster than the algorithm given in (Svetlitsky et. al., 2018) for larger values of insertions allowed. Additional advantages of the new algorithm are its simplicity of implementation, and the fact that it is easily parallelizable, yielding further scalability.

CSBFinder-S provides several novel mechanisms to help the user sort, filter, and interpret the discovered CSBs.

A ranking score that considers the genomic distances between the genomes in which the corresponding CSBs appear.
The user can constrain the structural features of the desired CSBs (length, abundance, etc.), as well as to extract CSBs confined to specific functional semantic categories.
A taxonomic viewer of the genomes that contain instances of each CSB.
Many other improvements have been incorporated in the user interface

Workflow Description

The workflow of CSBFinder-S is given in the figure below.

(A) The input to the workflow is a dataset of input genomes, where each genome is modeled as a sequence of gene identifiers: A gene identifier indicates the corresponding gene orthology group as well as the strand (+/-) in which the gene is encoded. Additional input consists of user-specified parameters k (number of allowed insertions) and q (the quorum parameter). In our formulation, a CSB is a pattern that appears as a substring of at least one of the input genomes, and has instances in at least q of the input genomes, where each instance may vary from the CSB pattern by at most k gene insertions.

(B) The genomes are mined to identify all patterns that qualify as CSBs according to the user-specified parameters.

(C) All discovered CSBs are ranked according to a probabilistic score.

(D) The CSBs are clustered to families according to their gene content similarity, and the rank of a family is determined by the score of its highest scoring CSB.

Citation

The following paper contains details regarding the first version of CSBFinder-S, denoted CSBFinder, that targeted the extraction of CSBs that correspond to operons. It contains details of the Suffix-Tree based algorithm for CSB extraction. The options to use the Suffix-Tree based algorithm, and the extraction of directon CSBs, are still available in the new CSBFinder-S tool.

If you used the tool as part of your research, please cite us:

When searching for cross-strand colinear syntenic blocks:

Dina Svetlitsky, Tal Dagan, Michal Ziv-Ukelson, Discovery of multi-operon colinear syntenic blocks in microbial genomes, Bioinformatics, Volume 36, Issue Supplement_1, July 2020, Pages i21–i29, https://doi.org/10.1093/bioinformatics/btaa503

When searching for colinear syntenic blocks that are conserved on the same strand:

Dina Svetlitsky, Tal Dagan, Vered Chalifa-Caspi, Michal Ziv-Ukelson, CSBFinder: discovery of colinear syntenic blocks across thousands of prokaryotic genomes, Bioinformatics, Volume 35, Issue 10, 15 May 2019, Pages 1634–1643, https://doi.org/10.1093/bioinformatics/bty861

Prerequisites

Java Runtime Environment (JRE) 8 or higher.

Running CSBFinder-S

Download

Download the latest release of CSBFinder-S installer.
The available options are Windows 64 or 32 bit, Unix and MacOS

CSBFinder-S has a user interface, but can be executed via the command line by executing the JAR file in the installation folder.

Running CSBFinder-S via User Interface

Just double click on the CSBFinder-S executable file in the installation folder, or (if you checked these options during installation) from the Start Menu / Desktop.

Note: If you are going to use a very large input dataset you might need to change the maximal memory that can be used by CSBFinder-S. Go to the installation folder and edit the file "CSBFinder-S.vmoptions" using a Text Editor. Change the Java option -Xmx500m to -Xmx[maximal heap size] depending of the available RAM. For example -Xmx6g sets the maximal JAVA heap size to 6GB.

It is recommended to use at least 6GB for a large dataset. You can specify a higher number, depending on you RAM size.

Importing input files

Importing a file containing the input genomes:
1. Choose File->Import->Genomes File. If your dataset is large, this make take a few minutes.
Sample input files are provided in the input directory in the installation folder
1. The "Run" button should be enabled. Click on this button to set the parameters.
2. A progressBar appears. Hover over the question mark icon next to each parameter for an explanation of each parameter. After setting the parameters, click on "Run". This can take a few minutes, depending on the size of the dataset and on the parameters specified.
3. After the process is done, the lower panel will contain all the discovered CSBs.
Importing a saved session file:
If you have ran CSBFinder-S and saved a session file, you can load it by choosing File->Import->Session File
Importing gene orthology group information (OPTIONAL):
Load it by choosing File->Import->Orthology Information file. This information will be displayed on the lower right panel.
Importing taxonomic information (OPTIONAL):
Load it by choosing File->Import->Taxonomy File. This information will be displayed in the Taxa View tab in the upper panel
Importing additional metadata (OPTIONAL):
Load it by choosing File->Import->Genome Metadata File. This information will be displayed in the Taxa View tab in the upper panel

Running CSBFinder-S via Command Line

CSBFinder-S can be executed via the command line by executing the JAR file in the installation folder.

In the terminal (linux) or cmd (windows) type:
```
java -jar CSBFinder-S-[version]-jar-with-dependencies.jar [options]
```
Note: If your input dataset is very large, add the argument -Xmx6g (6g might be enough, but you can specify a higher number, depending on your RAM size). For example:
```
java -Xmx6g -jar CSBFinder-S-[version]-jar-with-dependencies.jar [options]
```
Note: When running CSBFinder-S without command line arguments, the user interface will be launched.

Sample input files are provided below

Options:

Mandatory:

-in INPUT_DATASET_FILE_NAME
Input file relative or absolute path. See Input files formats for more details.
-q QUORUM
The quorum parameter. Minimal number of input sequences that must contain a CSB instance.
Default: 1 Min Value: 1 Max Value: Total input sequences

Optional:

-ins INSERTIONS
Maximal number of insertions allowed in a CSB instance. Default: 0
-lmin MIN_CSB_LENGTH
Minimal length (number of genes) of a CSB
Default: 2 Min Value: 2 Max Value: Maximal sequence length
-lmax MAX_CSB_LENGTH
Maximal length (number of genes) of a CSB
Default: Maximal sequence length Min Value: 2 Max Value: Maximal sequence length
--export-file EXPORT_FILE_NAME ,-e EXPORT_FILE_NAME
EXPORT_FILE_NAME will be the prefix of the names of the .xlsx/.txt output files.
Default: dataset1
--patterns PATTERNS_FILE_NAME, -p PATTERNS_FILE_NAME
A name of a file, located in a directory named 'input', in the same directory as the jar file.
If this option is used, CSBs are no longer extracted from the input sequences.
The file should contain specific CSB patterns that the user is interested to find in the input sequences.
See Input files formats for more details.
-cog-info COG_INFO_FILE_NAME
A name of a file, located in a directory named 'input', in the same directory as the jar file.
This file should contain functional description of orthology groups.
See Input files formats for more details.
--cross-strand, -cs If this option is provided, cross-strand CSBs will be extracted.
-out OUTPUT_FILE_TYPE
Output file type
Default: XLSX
Possible Values: [TXT, XLSX, SESSION]
-out-dir OUT_DIR
Path to output directory Default: output
-alg ALG_NAME
Algorithm to use for finding CSBs Default: SUFFIX_TREE Possible Values: [SUFFIX_TREE, MATCH_POINTS]
-keep-all-patterns If this option is provided, keep all patterns, without removing sub-patterns with the same number of instances
--threshold THRESHOLD, -t THRESHOLD
Threshold for family clustering
Default: 0.8
Min Value: 0
Max Value: 1
-clust-by CLUSTER_BY
Cluster CSBs to families by: 'score' or 'length'
Default: SCORE
Possible Values: [LENGTH, SCORE]
-clust-denominator CLUST_DENOMINATOR
In the greedy CSB clustering to families, a CSB is added to an existing cluster if the (intersection between the CSB and the Cluster genes/X) is above a threshold. Choose X. Default: MIN_SET
Possible Values: [MIN_SET, MAX_SET, UNION]
-skip-cluster-step
If this option is provided, skip the clustering to families step
-procs NUM_OF_PROCS
Number of processes. 0 designates the maximal number of available processes Default: 1
-h, --help
Show usage

Input files formats

Input file containing input genome sequences

A text/fasta file containing all input genomes modeled as strings, where each character is an orthology group ID (for example, COG ID) that has been assigned to a corresponding gene

This is a mandatory input file
The path to this file is provided in:
- User Interface: Load this file by choosing File->Import->Genomes File
- Command Line: "-in" option

This file should use the following format:

>[genome name] | [ replicon name (e.g. plasmid or chromosome id)]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information] 
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information] 
....

All replicons of the same genome should be consecutive, i.e.:

>genomeA|replicon1
....
>genomeA|replicon2
...
>genomeB|replicon1
...

Genes that do not belong to any gene orthology group, should be marked as 'X'

Example:

>Agrobacterium_H13_3_uid63403|NC_015183
COG1806	+
COG0424	+
COG0169	+
COG0237	+
COG0847	+
COG1952	-
COG3030	-
COG4395	+
COG2821	+
....
>Agrobacterium_H13_3_uid63403|NC_015508
X	+
X	+
COG1487	-
X	-
X	-
X	-
COG1525	-
X	+
COG2253	-
COG5340	-
....
>Agrobacterium_radiobacter_K84_uid58269|NC_011983
COG1192	+
COG1475	+
X	+
X	+
COG0715	+
COG0600	+
....

Assigning genes to orthology group identifiers

You can annotate genes by any orthology group identifiers. The IDs can be numbers or symbols, the only restriction is that each orthology group will have a unique ID.

Examples

The STRING database contains COG and NOG annotations of many publicly available genomes
Newly sequenced genomes can be mapped to known orthology groups such as:
- COGs using CDD
- NOGs using eggNOG mapper
A tool such as Proteinortho detects orthologous genes within different species.
The paper "New Tools in Orthology Analysis: A Brief Review of Promising Perspectives" by Bruno T. L. Nichio et. al. reviews several current tools for gene orthology detection

Input file with functional information of gene orthology group IDs

This is an optional input file
The path to this file is provided in:
- User Interface: File->Import->Orthology Information file
- Command Line: "-cog-info" option

COG information input file

If you are using COGs (Cluster of Orthologous Genes) as your gene orthology group identifiers, you can use the file cog_info.txt provided in the input directory in the installation folder (also can be downloaded from here).

The functional description of gene orthology groups will appear in the legend (User Interface) or in the output catalog file (when choosing "Export" in the User Interface, or when executing via Command Line).

You can also use a custom file of your own. See instructions below.

Custom gene orthology group information input file

This file should use the following format:

COGID;COG description;[COG functional categries seperated by a comma (e.g. "E,H"); COG functional categry description 1; COG functional categry description 2;...;geneID]

The text inside the brackets [] is optional

Example

COG0318;Acyl-CoA synthetase (AMP-forming)/AMP-acid ligase II;I,Q;Lipid transport and metabolism;Secondary metabolites biosynthesis, transport and catabolism;CaiC;
COG0319;ssRNA-specific RNase YbeY, 16S rRNA maturation enzyme;J;Translation, ribosomal structure and biogenesis;YbeY;
COG0320;Lipoate synthase;H;Coenzyme transport and metabolism;LipA;
...

Input file containing CSB patterns

If this file is provided, CSBs are no longer extracted from the input sequences. This file shohuld contain specific CSB patterns which the user is interested to find in the input sequences.

This is an optional input text file
The path to this file is provided in:
- User Interface: In the progressBar opened after clicking on the "Run" button
- Command Line: "--patterns" or "-p" option

This file should use the following format:

>[unique pattern ID, must be an integer]
[homology group IDs seperated by commas]
>[unique pattern ID, must be an integer]
[homology group IDs seperated by commas]

Example

>1
COG3736,COG3504,COG2948,COG0630
>564654
COG3736,COG3504,COG2948
....

If you are running in "cross-strand" mode, you should add a strand (+/-) to each homology group ID e.g. COG3736+,COG3504+,COG2948-,COG0630+

Input file containing Taxonomy information

If this file is provided, taxonomic distribution of each CSB will be displayed in the user interface.

This is an optional input text file
User Interface: Load this file by choosing File->Import->Taxonomy File

This file should use the following format:

HEADER
genome-name(as provided in input genomes file),kingdom,phylum,class,genus,species
genome-name(as provided in input genomes file),kingdom,phylum,class,genus,species
...

Missing data should be marked by "-"

Example

genome,kingdom,phylum,class,genus,species
Acaryochloris_marina_MBIC11017_uid58167,Bacteria,Cyanobacteria,-,Acaryochloris,Acaryochloris_marina
Acetobacter_pasteurianus_IFO_3283_01_uid59279,Bacteria,Proteobacteria,Alphaproteobacteria,Acetobacter,Acetobacter_pasteurianus
....

Input file containing additional metadata

If this file is provided, the metadata from each genome, which contains a CSB or a CSB family, will be diplayed.

This is an optional input csv file
User Interface: Load this file by choosing File->Import->Genome Metadata File

This file should use the following format:

genome,col2,col3,...
genome-name(as provided in input genomes file),col2_data,col3_data,...
genome-name(as provided in input genomes file),col2_data,col3_data,...
...

The first column must contain the name of the genome, as provided in input genomes file. You can choose your own column names.
There is not limitation on the number of columns.

Example

genome,isolation_source,gram_stain
Acaryochloris_marina_MBIC11017_uid58167,soil,negative,...
Acetobacter_pasteurianus_IFO_3283_01_uid59279,plant,negative,...
....

Output files

After clicking on the "Export" menu option in the User Interface, or after execution via Command Line: two output files will be written to the specified directory

File 1: A Catalog of CSBs: An excel file (or txt file) containing the discovered CSBs named "[export file name].xlsx"
This file contains three sheets:
1. Catalog
  - Each line describes a single CSB
    - ID: unique CSB ID
    - Length: number of characters in the CSB
    - Score: a probabilistic ranking score, higher score indicates higher significance
    - Instance count: number of input sequences with an instance
    - CSB: a sequence of genes
    - Main_Category: if functional category was provided in the -cog-info file, this column contains the functional category of the majority of CSB gene families
    - Family_ID: CSBs with similar gene content will belong to the same family, indicated by a positive integer
2. Filtered CSBs
  - This sheet contains only the top scoring CSB from each family
3. CSBs description
  - Information about gene family IDs of each CSB
File 2: CSB instances: A FASTA file with the same name as the catalog file, only with the suffix "_instances"
- Each entry represents a CSB and all its instances in the input genomes
- Each entry is composed of a header (CSB ID and genes), followed by lines describing the instances
- Each line describes the locations of CSB instances in a specific input genome
- There can be more than one instance in each genome
- Each instance that is present in a replicon (e.g. chromosome/plasmid), begins from a specific index and can have different lengths, depending on the number of insertions allowed in the instance
- The index of the first gene in a replicon is 0
This file has the following format:
```
>[CSB ID] TAB [CSB genes]
 [genome name] TAB [replicon name]|[[instance start index (inclusive),instance end index (exclusive)]]
 [genome name] TAB [replicon name]|[[instance start index (inclusive),instance end index (exclusive)]]

 ...
 >[CSB ID] TAB	[CSB genes]
 [genome name] TAB [replicon name]|[[instance start index (inclusive),instance end index (exclusive)]]
 [genome name] TAB [replic     [genome name] TAB [replicon name]|[[instance start index (inclusive),
 instance end index (exclusive)]]
 ...
```
Example
```
>4539	COG1012 COG0665 
Rhizobium_leguminosarum_bv__trifolii_WSM2304_uid58997	NC_011368|[829,831]	NC_011368|[832,834]
Agrobacterium_vitis_S4_uid58249	NC_011981|[171,173]
```

Example of running CSBFinder-S

Sample input files are located in the input directory of the installation folder. You can also download by clicking on this link:

Sample_input_files.zip

The above zip file contains four files, located inside a folder named 'input':

plasmid_genomes.fasta
Plasmid dataset - 471 prokaryotic genomes with at least one plasmid, chromosomes were removed.
chromosomal_genomes.fasta
Chromosomal dataset - 1,485 prokaryotic genomes with at least one chromosome, plasmids were removed.

Important: this is a huge dataset. See instructions below, how to run CSBFinder-S with a large dataset
cog_info.txt
Functional information of gene orthology groups
taxa_csbfinder.txt
Taxonomy information for the user interface
metadata.csv Sample genome metadata information for the user interface

Execution of CSBFinder-S on the Plasmid Dataset of 471 microbial genomes

User Interface

Execute CSBfinder-S and choose File->Import->Genomes File
You can also import cog_info.txt and taxa_csbfinder.txt for additional displayed information
Click on the "Run" button, and a window will open.
Set the parameters (e.g. Quorum 10, Insertions Allowed 1). If you are interested in cross-strand CSBs, check the corresponding check-box. The algorithm for CSB extraction can also be selected.
Clicking on the "Run" button will start the computation of CSBs, this may take a few minutes. When the process is done, the results will be shown.

Output

You can export the resulting CSBs as a TXT or XLSX file. You can also save the results in a session file, that can be opened using the user interface.

Command Line

java -jar CSBFinder-[version]-jar-with-dependencies.jar -in input/plasmid_genomes.fasta -q 10 -ins 1 -e plasmids 
-cog-info input/cog_info.txt

Input parameters

The input genomes files is plasmid_genomes.fasta located in the input directory.
The quorum parameter is set to 10 (i.e., each CSB must have instances in at least 10 input genomes).
The number of allowed insertions in a CSB instance is one.
The export file name is "plasmids"
The gene orthology input file is cog_info.txt located in the input directory

Output

The output files will be now located in the output directory

On a laptop computer with Intel model i7 processor and 8GB RAM, this execution should take a few seconds

Execution of CSBFinder-S on the Chromosomal Dataset of 1,485 prokaryotic genomes

User Interface

The file chromosomal_genomes.fasta contains ~1,500 genomes, hence CSBFinder-S needs more heap memory. When uploading a large dataset.

Go to the installation folder and edit the file "CSBFinder-S.vmoptions" using a Text Editor. Change the Java option -Xmx500m to -Xmx[maximal heap size] depending of the available RAM. Changing to at least -Xmx6g is recommended (sets the maximal JAVA heap size to 6GB).

Now execute CSBfinder-S and choose File->Import->Genomes File, it may take a few minutes to load the selected file.
You can also import cog_info.txt and taxa_csbfinder.txt for additional displayed information
Click on the "Run" button, and a window will open.
Set the parameters (e.g. Quorum 50, Insertions Allowed 1). If you are interested in cross-strand CSBs, check the corresponding check-box. The algorithm for CSB extraction can also be selected.
Clicking on the "Run" button will start the computation of CSBs, this may take a few minutes. When the process is done, the results will be shown.

Output

You can export the resulting CSBs as a TXT or XLSX file. You can also save the results in a session file, that can be opened using the user interface.

Command Line

In the installation directory, you will find a *.jar file, e.g., CSBFinder-S-0.6.1-jar-with-dependencies.jar

java -Xmx6g -jar CSBFinder-S-[version]-jar-with-dependencies.jar -in input/chromosomal_genomes.fasta -q 50 -ins 1 -e Chromosomes 
-cog-info input/cog_info.txt

Input parameters

This line will execute the jar file with maximal heap size (memory) of 6GB.
The input genomes files is chromosomal_genomes.fasta located in the input directory.
The quorum parameter is set to 50 (i.e., each CSB must have instances in at least 50 input genomes).
The number of allowed insertions in a CSB instance is one.
The export file name is "Chromosomes"
The gene orthology input file is cog_info.txt located in the input directory

Output

The output files will be now located in the output directory

On a laptop computer with Intel model i7 processor and 8GB RAM, this execution should take less than 5 minutes

User interface features

Save - saving a session file (*.csb extension). This will save the current session, all filtered-out CSBs will be lost.
Double-clicking on a CSB gene, aligns all other CSBs/instances according to this gene
Re-clustering to families after filtration
Re-computing CSB scores with different paramaters

Properties files

The following files are present in the installation directory

config.properties:
Include paths to Session file, Taxonomy file, and Orthology info file. These files will be loaded automatically when launching the program
CSBFinder-S.vmoptions:
Increase the memory (RAM) used by the program, by changing MEM in -Xmx[MEM] (e.g., -Xmx6g)

CSBFinder-S uses install4j - a multi-platform installer builder

Name		Name	Last commit message	Last commit date
Latest commit History 423 Commits
images		images
input		input
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.properties		config.properties
pom.xml		pom.xml

License

dinasv/CSBFinder

Folders and files

Latest commit

History

Repository files navigation

CSBFinder-S

March 27, 2019 update

Workflow Description

Citation

Importing input files

Options:

Example:

Assigning genes to orthology group identifiers

Examples

COG information input file

Custom gene orthology group information input file

Example

Example

Example

Example

Example

Execution of CSBFinder-S on the Plasmid Dataset of 471 microbial genomes

User Interface

Output

Command Line

Input parameters

Output

Execution of CSBFinder-S on the Chromosomal Dataset of 1,485 prokaryotic genomes

User Interface

Output

Command Line

Input parameters

Output

Properties files

About

Topics

Resources

License

Stars

Watchers

Forks

Languages