Skip to content

zhangsjsky/CSTK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

I. Introduction
II. Prerequisites
III. Download and Install
IV. Tutorial
  4.1 Common Rules
  4.2 Example
V. Get Help

I. Introduction

C & S's Toolkit (CSTK) is a package composed of perl, R, python and shell scripts. Using the specific script in the package or combination of scripts in the package as pipeline, may meet the general requirement of bioinformatics analysis and complete miscellaneous tasks. The function modules of CSTK include but not limited in:

A. File format converting

B. Processing of standard format file

C. Table manipulation

D. Statistical testing

E. Survival analysis

F. Data visualization

G. Parsing of third-party tool result

H. Gene/region expression quantification

I. Alternative splicing identification and quantification

II. Prerequisites

CSTK mainly requires the following software/packages:

Software/Package Version Description
Perl >=5.010 The Perl language program.
BioPerl Not limited Needed only by fqConverter.pl and mafLiftOver.pl in current version.
R >=3.3.2 The R language program, mainly used for statistical analysis and data visualization.
ggplot2 2.x A R package for data visualization.
Python =2.7 The Python language program.
SAMTools >=1.5 The toolkit to manipulate BAM files.

The given version is just to suggest you to use this version, but not to prohibit you from using older version, although we haven’t tested the older ones. More required software/packages are script-specific. If specific script in CSTK requires specific software/packages, please install it/them according to the error information.

III. Download and Install

Please download the CSTK from the release page or clone it with git.

It's extremely easy to install CSTK, because all the source codes are written with script languages and no compilation needed. After installing the software/packages required in the "Prerequisites" section, CSTK is ready to be used.

Before using CSTK, firstly add the path of CSTK into the environment variable $PATH, in order to use scripts in CSTK at the command line directly. Assuming CSTK is decompressed into /home/<yourUserName/bin>/CSTK, run the following command:

PATH=/home/<yourUserName>/bin/CSTK:$PATH
chmod u+x /home/<yourUserName>/bin/CSTK/*.{pl,R,py,sh}

You can add the first command line into your configuration file of environment (in general, it's /home/<yourUserName>/.bashrc), so that the CSTK is available to be used after you login every time:

cat <EOF >>/home/<yourUserName>/.bashrc
PATH=/home/<yourUserName>/CSTK:$PATH
EOF

Rather than downloading and installing manually, it's recommended to setup up by Docker (https://hub.docker.com/r/zhangsjsky/cstk) or Singularity (https://cloud.sylabs.io/library/_container/5ed705dfd0ff9c878fea7e50). In this way you don't need to install the dependencies, which had been installed in the container.

# For Docker
docker pull zhangsjsky/cstk:latest
# For Singularity
singularity pull library://zhangsjsky/default/cstk

IV. Tutorial

4.1 Common Rules

The following rules are commonly applied in CSTK:

A) For script with only one input file, the input is passed in from STDIN (Standard Input), except that the input file is specified with argument; For script with only one output file, the output is printed to STDOUT (Standard Output), meanwhile the STDERR (Standard Error) may be used for log output; For script with 2 output files, the output may be printed to STDOUT and STDERR, respectively, or may be printed to STDOUT and the file specified with option parameter, respectively. Using STDOUT and STDERR preferentially to option is convenient to pipe the analysis procedures into a pipeline. Pipeline can also speed up the analysis, meanwhile avoid the immediate files reading and writing hard disk.

B) Help information of almost all the scripts can be viewed with -h, -help or --help options.

C) For option with value, option and value of Perl and Shell scripts are separated by a space character (e.g. -option value), while those of R are separated by an equal mark (e.g. -option=value).

4.2 Example

Only one or two scripts in each function module of CSTK are illustrated in following examples, illustration for more scripts will be appended according to users' feedbacks.

4.2.1 File Format Converting

  • fqConverter.pl
fqConverter.pl -h

Usage: perl fqConverter.pl input.fq >output.fq

    If input.fq isn't specified, input from STDIN

    If output.fq isn't specified, output to STDOUT

Option:

    -s --source  STR  The quality format of your source fastq file ([fastq-illumina], fastq-solexa, fastq)

    -t --target   STR  The quality format of your target fastq file (fastq-illumina, fastq-solexa, [fastq])

    -h --help          This help information screen

Since Fastq is the common format for bioinformatics, the content of input file is not illustrated here. For details of Fastq format, please visit wiki.

Running example:

zcat myReads.illumina.fq.gz | fqConverter.pl >myReads.sanger.fq

The command converts the fastq in illumina format (default of -s option) to the famous sanger format (default of -t option).

Pipeline is applied in the command to illustrate the advantage of piping the CSTK. You can also further improve it as:

zcat myReads.illumina.fq.gz | fqConverter.pl | gzip -c >myReads.sanger.fq.gz

In this way, the output fastq is stored as compressed gz file. In these analysis procedures, the CSTK script act only as an adapter in the pipeline, that is to pass in the input from the output of the previous procedure and print the output to the next procedure as its input.

  • gpe2bed.pl
gpe2bed.pl -h

Usage: perl gpe2bed.pl INPUT.gpe >OUTPUT.bed

    If INPUT.gpe isn't specified, input from STDIN

    Output to STDOUT

Option:

    -b --bin            With bin column

    -t --bedType  INT  Bed type. It can be 3, 6, 9 or 12[12]

    -i --itemRgb  STR  RGB color[0,0,0]

    -g --gene          Output 'gene name' in INPUT.gpe as bed plus column

    -p --plus           Output bed plus when there are additional columns in gpe

    -h --help           Print this help information

Assuming there is an input file (example.gpe) with the following contents:

76 NM_015113 chr17 - 3907738 4046253 3910183 4046189 55 3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835, 3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253, 0 ZZEF1 cmpl cmpl 0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0,
147 NM_001308237 chr1 - 78028100 78149112 78031324 78105156 14 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78105133,78107068,78107206,78148946, 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78105287,78107131,78107340,78149112, 0 ZZZ3 cmpl cmpl 2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1,
147 NM_015534 chr1 - 78028100 78148343 78031324 78099039 15 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269, 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343, 0 ZZZ3 cmpl cmpl 2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1,-1,

As shown, there is a bin column (the first column) in the gpe file, so the -b option should be specified. For the description of gpe file, please visit UCSC: http://genome.ucsc.edu/FAQ/FAQformat.html#format9.

Run the script:

gpe2bed.pl -b example.gpe >example.bed

The content of the output bed file:

chr17 3907738 4046253 NM_015113 0 - 3910183 4046189 0,0,0 55 2526,72,154,166,99,169,144,360,136,101,192,120,200,133,175,278,140,151,152,263,166,139,130,162,120,165,469,94,80,241,149,202,82,167,123,122,160,115,68,134,170,101,216,112,149,92,99,179,117,211,200,172,195,145,418 0,4438,5159,9004,9645,9902,11878,12926,13391,15224,16684,18264,20474,27681,28383,29570,37984,39779,45263,46336,47526,49612,51771,53549,54726,58308,59916,62002,62718,66239,68163,69705,70652,70818,72192,72423,73438,76931,77992,81225,82041,82989,84233,86274,91386,92164,97872,100188,101248,105208,108164,109854,112527,119462,138097
chr1 78028100 78149112 NM_001308237 0 - 78031324 78105156 0,0,0 14 3369,101,135,153,96,102,72,116,148,139,154,63,134,166 0,3665,5916,13652,16358,17111,18582,19360,19563,22101,77033,78968,79106,120846
chr1 78028100 78148343 NM_015534 0 - 78031324 78099039 0,0,0 15 3369,101,135,153,96,102,72,116,148,139,1556,154,63,134,74 0,3665,5916,13652,16358,17111,18582,19360,19563,22101,69434,77033,78968,79106,120169

The -t is 12 in default, so the output bed file is bed12 format in default. You can also try the -g or -p option to modify or add columns in output.

4.2.2 Processing of Standard Format File

  • gpeMerge.pl
gpeMerge.pl -h

Usage: perl gpeMerge.pl input.gpe >output.gpe

    If input.gpe not specified, input from STDIN

    Output to STDOUT

    -b --bin                  With bin column

    -l --locus                  Merge with locus (default: merge gene)

    -t --longTranscript         Overlap is against long transcript (default against short transcript)

    -n --name                Set the "gene name" column as "transcript name(s)" when the corresponding gene name unavailable

    -p --percent        DOU  Minimal overlap percent to merge transcript (default: 0)

    -h --help                  Print this help information

This script can merge different transcripts of the same gene (or the same locus if the -l option specified). The merging criterion is: for each site of a gene, if in any transcript the site is located in exon, the site is treated as exonic site in the merged result, otherwise treated as intronic site.

A diagram to intuitively illustrate the merging:

Merging_Illustration

                                  From C.

In the figure, the top and middle lines are the two transcripts of the same gene, the bottom line is the result after merging.

Use the example.gpe file in the previous section as input to run this script:

gpeMerge.pl -b example.gpe >merged.gpe

The content of the output:

6 NM_015113 chr17 - 3907738 4046253 3910183 4046189 55 3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835, 3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253, 0 ZZEF1 cmpl cmpl 0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0,
147 NM_001308237,NM_015534 chr1 - 78028100 78149112 78028100 78149112 16 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269,78148946 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343,78149112 0 ZZZ3 unk unk -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,

As you can see, the two records of the transcripts of the ZZZ3 gene have been merged into one record. The start and end of exons are updated as the coordinates after merging and the other information associated with coordinate is also updated according to the coordinates after merging. The column of transcript name is also updated as comma-separated transcript list.

  • gpeFeature.pl
gpeFeature.pl -h

Usage: perl gpeFeature.pl OPTION INPUT.gpe >OUTPUT.bed

    If INPUT.gpe isn't specified, input from STDIN

    Example: perl gpeFeature.pl -b -g hg19.size --upstream 1000 hg19.refGene.gpe >hg19.refGene.bed

Option:

    -b --bin                With bin column

    -i --intron               Fetch introns in each transcript

    -e --exon               Fetch exons in each transcript

    -c --cds                 Fetch CDS in each transcript

    -u --utr                 Fetch UTRs in each transcript, 5'UTR then 3'UTR (or 3' first)

    -p --prime        INT   5 for 5'UTR, 3 for 3'UTR(force -u)

       --complete          Only fetch UTR for completed transcripts

       --upstream     INT  Fetch upstream INT intergenic regions(force -g)

       --downstream  INT  Fetch downstream INT intergenic regions(force -g)

    -g --chrSize       FILE  Tab-separated file with two columns: chr name and its length

    -s --single              Bundle all features into single line for each transcript

       --addIndex          Add exon/intron/CDS/UTR index as suffix of name in the 4th column

    -h --help               Print this help information

This script is used to extract specific feature from the gpe file and output in bed format.

For example, to extract exons:

gpeFeature.pl -b example.gpe -e >exon.bed6+
chr17 3907738 3910264 NM_015113 0 - NM_015113 2526 ZZEF1
chr17 3912176 3912248 NM_015113 0 - NM_015113 72 ZZEF1
chr17 3912897 3913051 NM_015113 0 - NM_015113 154 ZZEF1

More records of the output are omitted. The result is in bed6+ format with each line representing an exon. The last two columns present the exon length and gene name.

4.2.3 Table Manipulation

  • tsvFilter.pl
tsvFilter.pl -h

Usage: perl tsvFilter.pl -o originFile.tsv -1 1,4 -m i|include targetFile.tsv >filtered.tsv

    If targetFile.tsv isn't specified, input is from STDIN

    Output to STDOUT

Option:

    -o --originFile    TSV  The original file containing fields (specified by --originFields) used to include or exclude lines in targetFile.tab

    -1 --originFields  STR  Comma-separated field list specifying which fileds in the originFile.tab to be used to include or exclude, 1-based start [1]

                          The element of the list can be a single column number or a range with nonnumeric char as separator

                          To specify the last column left the range right margin blank

                          If continuous range specified like '1-3-6', the first range '1-3' will be output

                           e.g.:

                            -1 1,4     output columns 1,4

                            -1 1-4,6..8 output columns 1,2,3,4,6,7,8

                            -1 1,4,6-   output columns 1,4,6,7,... last column

                            -1 1-3-6   output columns 1,2,3

    -2 --targetFields  STR  Comma-separated field list specifying which fileds in the targetFile.tab are used to include or exclude lines, 1-based start [1]

                          More description about --targetFields, see --originFields

    -m --mode       STR  To include or exclude lines in targetFile.tab, it can be i|include or e|exclude[e]

    -s --separator    STR  (Optional)A separator to join the fields specified, if necessary[Empty string]

    -h --help              Print this help information

This script is often used in table manipulation. It's function is to filter a table (i.g. target table) according to one column or some columns (i.g. the target columns) with one or some columns (i.g. the source column) of another table (i.g. source table). The mode on it's based is whether target columns include (or exclude) the source columns.

Assuming the first column of the source table (source.tsv) is gene name, the second column is up- or down- regulation mark:

Gene1 Up
Gene2 Down
Gene3 Up

The first column of the target table (target.tsv) is gene name, the second column is up- or down- regulation fold-change:

Gene1 5
Gene2 10
Gene3 2

If you want to know the fold-change of the up-regulated genes, run the following command:

tsvFilter.pl -o <(awk '$2=="Up"' source.tsv) -m i target.tsv

The output is:

Gene1 5
Gene3 2

The command fetches up-regulated genes with awk firstly, and then feeds it to the -o option as input to filter the target.tsv file. -m option specifies the filtering mode as "include". Because both the source column and target column are the first column, -1 and -2 options are specified as 1 in default.

  • tsvJoin.sh
tsvJoin.sh

Usage: tsvJoin.sh OPTIONS [-1 field1][-2 fields2] input1.tsv [input2.tsv]

Note: input2.tsv can be omitted or be a "-" to input the data from STDIN.

      when input1.tsv is a "-" (i.e. from STDIN), input2.tsv must be specified.

Options:

-1|field1           STR  The field in input1.tsv used when joining[1]

                         Refer to the -1 option of linux join command

-2 field2           STR  The field in input2.tsv used when joining[1]

-i|inputDelimiter    STR  The delimiter of your input file[\t]

-j|joinDelimiter     STR  The delimiter used by linux join commond (i.e. the -t option of join command)[|]

-o|outputDelimiter  STR  The delimiter of the output[\t]

-a|unpairedFile      INT  Also print unpairable lines from file INT (1, 2)

Please specify at least one file

This script is also often used in table manipulation. It's function is to join two tables according to whether the values of one or some columns are the same. Different from the join command built in Linux, this script can use the tab character as separated character.

Take the previous source.tsv and target.tsv as examples, join them according to gene name:

tsvJoin.sh source.tsv target.tsv >join.tsv 2>/dev/null

The output join.tsv is:

Gene1 Up 5
Gene2 Down 10
Gene3 Up 2

In the command, '2>/dev/null' means discarding the STDERR (for output running information). Because the columns used to join in the two input table are both the first column, -1 and -2 options are specified as 1 in default.

In the above output, the column used to do joining (i.g. the gene name column) is set as the first column and is followed by the other columns in the first and the second files, except for the column used to do joining.

4.2.4 Statistical Testing

  • testT.R
testT.R -h

Usage: testT.R -option=value <input.lst|<input1.lst input2.lst|input1.lst input2.lst >pValue

Option:

    -p|pair      Pair

    -a|alt  STR  The alternative hypothesis: [two.sided], greatr or less

    -h          Show help

This script is used to conduct T-test testing. '<input.lst|<input1.lst input2.lst|input1.lst input2.lst' means there are a few input manners:

A) Input one file from STDIN;

B) Input from STDIN and file (specified by an argument), respectively;

C) Input from two files (specified by two arguments), respectively.

The commonly used manner is the last one. Assuming the content of the first file (value1.lst) is:

1

2

3

4

5

And that of the second file is:

3

4

5

6

7

8

testT.R value1.lst value2.lst

0.0398030820241363

The output p-value indicates there is significant difference between the two list of values.

  • testWilcoxon.R
testWilcoxon.R -h

Usage: testWilcoxon.R -option=value <input.lst|input.lst|input1.lst input2.lst >pValue

Options

    -a|alt    STR  The alternative hypothesis ([two.sided], greater, less)

                   You can specify just the initial letter

    -m|mu  DOU  A parameter used to form the null hypothesis for one-sample test[0]

    -h|help        Show help

testWilcoxon.R value1.lst value2.lst 2>/dev/null

0.06601543

The output p-value indicates there isn't significant difference between the two list of values. The conclusion is different from the T-test.

4.2.5 Survival Analysis

  • survival.R
survival.R -h

Usage: survival.R -option=value <input.tsv

Option:

    -p|pdf     PDF  The KM figure[KM.pdf]

    -w|width   INT  The figure width

       height  INT  The figure height

       header       With header

    -m|main   STR  The main title

    -x|xlab     STR  The xlab[Time]

    -y|ylab     STR  The ylab[Survival Probability]

    -h              Show help

Input (header isn't necessary):

  Example1:

    #time event

    1     TRUE

    2     TRUE

    2     TRUE

    8     FALSE

    5     TRUE

    10    FALSE

  Example2:

    #time event  group

    1     TRUE  male

    2     TRUE  male

    2     TRUE  male

    8     FALSE male

    5     TRUE  female

    10    FALSE female

As the help information presents, creating an input file as the Example2 (survival.tsv). Pass it to the script as input and specify the output pdf file:

survival.R -p=survival.pdf <survival.tsv >survival.log 2>survival.err

Survival curve (i.g. Kaplan-Meier curve) and the log-rank p-value are stored in the pdf. Statistical measurements are stored in survival.log.

4.2.6 Data Visualization

  • bar.R
bar.R -h

Usage: bar.R -p=outputName.pdf <input.tsv

Option:

Common:

    -p|pdf    FILE  The output figure in pdf[figure.pdf]

    -w|width  INT  The figure width

    -height   INT  The figure height

    -m|main  STR  The main title

Contents omitted…

    -annoTxt   STRs  The comma-separated texts to be annotated

    -annoTxtX  INTs  The comma-separated X positions of text

    -annoTxtY  INTs  The comma-separated Y positions of text

    -annoTxtS  DOU  The annotated text size[5]

Skill:

    Legend title of alpha, color, etc can be set as the same to merge their guides

This script is used to draw bar chart. Take the survival.tsv in the previous section as input to draw:

awk 'BEGIN{OFS="\t"}{print NR,$1,$3}' survival.tsv | bar.R -p=bar.pdf -fillV=V3 -x='Patient ID' -y=Time -fillT=Gender

In the command, awk is used to process the input before passing the data to bar.R. The processed results are:

1 1 male
2 2 male
3 4 male
4 8 male
5 5 female
6 10 female

The extra patient ID information is added as the first column and will be presented at X axis. The second column is survival time and will be presented at Y axis. The third column is gender and will be used to fill bar with different color. If the third column isn't presented, all the bars are filled in black in default.

In the parameters of bar.R, -fillV=V3 tells it to use the third column (column name: V3) to color bars. Don't specify this option if no third column. -x and -y specify the label of X and Y axis, respectively. -fillT specify the title of the fill-legend.

  • hist.R
hist.R -h

Usage: hist.R -p=outputName.pdf <input.tsv

Option:

    Common:

    -p|pdf    FILE   The output figure in pdf[figure.pdf]

    -w|width  INT   The figure width

    -m|main  STR   The main title

    -mainS   DOU  The size of main title[22 for ggplot]

    -x|xlab    STR   The xlab[Binned Values]

    -y|ylab    STR   The ylab

Contents omitted…

    -annoTxt   STRs  The comma-separated texts to be annotated

    -annoTxtX  INTs  The comma-separated X positions of text

    -annoTxtY  INTs  The comma-separated Y positions of text

Skill:

    Legend title of alpha, color, etc can be set as the same to merge their guides

This script is used to draw histogram.

hist.R -p=hist.pdf -x=Time -y='Patient Count' <survival.tsv

Only one column of values is needed to draw a histogram, so pass in survival.tsv as input then the first column will be used default to draw histogram.

4.2.7 Parsing of Third-party Tool Result

  • coverageBedParser.pl
coverageBedParser.pl -h

Usage: perl coverageBedParser.pl coverageBedOutput.tsv >OUTPUT.tsv

    If coverageBedOutput.tsv isn't specified, input from STDIN

Option:

    -b -bedFormat  INT  Bed format ([3], 6)

    -h --help            Print this help information

This script is used to parse the output result of 'bedtools coverage', calculate the mean depth and covered fraction of each region.

coverageBedParser.pl coverage.bed >coverageParsed.bed
  • fastqcParser.pl
fastqcParser.pl -h

Usage: perl fastqcParser.pl OPTION fastqc_data.txt >OUTPUT.tsv

    If fastqc_data.txt isn't specified, input from STDIN

Options:

    -h --help  Print this help information

This script is used to parse the result (in general, the fastqc_data.txt) of FastQC, extract read-count, GC content and mean quality etc. and output as a table.

fastqcParser.pl fastqc_data.txt >fastqcDataParsed.tsv

4.2.8 Gene/Region Expression Quantification

  • geneRPKM.pl
geneRPKM.pl -h

Usage: perl geneRPKM.pl -g gene_structure.gpe -s 4 INPUT.BAM >RPKM.bed6+ 2>running.log

    If INPUT.BAM isn't specified, input is from STDIN

    Output to STDOUT in bed6 (gene in name column, RPKM in score column) plus longest transcript, readNO and transcript length

    This script chooses the LONGEST transcript of each gene as reference transcript to measure RPKM

Option:

    -g --gpe     FILE  A gpe file with comment or track line allowed

    -b --bin           With bin column

    -l --libType  STR  The library type, it can be

                      fr-unstranded: for Standard Illumina (default)

                      fr-firststrand: for dUTP, NSR, NNSR

                      fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol

    -s --slop     INT  Specify the slopping length from the exon-intron joint to intron[0]

    -u --uniq          Only use uniquely-mapped reads (NH=1)to compute RPKM

       --log     FILE  Record running log into FILE

    -h --help          Print this help information

As the help information describes, this script chooses the longest transcript of each gene as the reference to quantify the expression of each gene in RPKM. The output result is stored in bed6+. A running example:

geneRPKM.pl -g refGene.gpe -b -s 4 final.bam >RPKM.bed6+ 2>geneRPKM.log

In the command, -s option specifies the length in which reads can extend from exon into intron. This option is devised to handle the situation that reads may be incapable of spanning an intron to be aligned to adjacent exon with a few base-pair.

This script may consume too much memory (depend on the size of the input bam file). If the memory isn't enough on the machine, the script geneRPKM_mem.pl, which consumes little memory but runs slower, may be an alternative choice.

  • regionRPKM.pl
regionRPKM.pl -h

Usage: perl regionRPKM.pl -b region.bed INPUT.bam >RPKM.bed 2>running.log

    If INPUT.bam isn't specified, input from STDIN

    Output to STDOUT with bed columns plus reads count in region and its RPKM

Note: INPUT.bam should be indexed with samtools index

    This script is for handling bam file in normal size that can be entirely cached into memory.

    It's MEMORY-CONSUMED but low TIME-CONSUMED compared to its equivalent regionRPKM_mem.pl.

    Spliced reads are handled now. Those that include the whole region within intron aren't counted.

Option:

    -b|bedFile  FILE  Region file in bed4 or bed6 format. bed plus is allowed.

    -l|libType   STR  The library type, it can be

                     fr-unstranded: for Standard Illumina (default)

                     fr-firststrand: for dUTP, NSR, NNSR

                     fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol

    -h --help         Print this help information

This script calculates the RPKM of specific regions (specified by --bedFile option). A running example:

regionRPKM.pl --bedFile exon.bed final.bam >RPKM.bed 2>regionRPKM.log

Similarly, if the memory isn't enough, use regionRPKM_mem.pl as alternative.

4.2.9 Alternative Splicing Identification and Quantification

Only identification and quantification of SE (Skipping Exon) event are implemented in current version.

  • psiSE.pl
psiSE.pl -h

Usage: perl psiSE.pl INPUT.bam >OUTPUT.bed6+

    If INPUT.bam isn't specified, input from STDIN

Option:

    -b --bed        FILE  Gene models in bed12 format

    -l --libraryType  STR  The library type, it can be

                         fr-unstranded: for Standard Illumina (default)

                         fr-firststrand: for dUTP, NSR, NNSR

                         fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol

    -s --slop        INT  Maximal slope length for a read to be considered as exonic read[4]

    -r --minRead    INT  Minimal supporting reads count for an exclusion junction[2]

    -h --help             Print this help information

Output:

    The 4th column is the transcript name and the exon rank (in transcriptional direction) separated by a dot.

    The 5th column in OUTPUT.bed6+ is the PSI normalized into 0-1000.

    Additional columns are as follow:

        inclusion read count

        inclusion region length

        inclusion read density

        exclusion read counts separated by comma

        exclusion region lengths separated by comma

        exclusion read density

This script is used to identify SE events and quantify the PSI of them from the alignments. An using example:

psiSE.pl final.bam >SE.bed6+

Refer to the help information for the format of the output result.

V. Get Help

You can send the author Sky any information about this toolkit, like bug reporting, typos and performance improvement suggestion.