I. Introduction

C & S's Toolkit (CSTK) is a package composed of perl, R, python and shell scripts. Using the specific script in the package or combination of scripts in the package as pipeline, may meet the general requirement of bioinformatics analysis and complete miscellaneous tasks. The function modules of CSTK include but not limited in:

A. File format converting

B. Processing of standard format file

C. Table manipulation

D. Statistical testing

E. Survival analysis

F. Data visualization

G. Parsing of third-party tool result

H. Gene/region expression quantification

I. Alternative splicing identification and quantification

II. Prerequisites

CSTK mainly requires the following software/packages:

Software/Package	Version	Description
Perl	>=5.010	The Perl language program.
BioPerl	Not limited	Needed only by fqConverter.pl and mafLiftOver.pl in current version.
R	>=3.3.2	The R language program, mainly used for statistical analysis and data visualization.
ggplot2	2.x	A R package for data visualization.
Python	=2.7	The Python language program.
SAMTools	>=1.5	The toolkit to manipulate BAM files.

The given version is just to suggest you to use this version, but not to prohibit you from using older version, although we haven’t tested the older ones. More required software/packages are script-specific. If specific script in CSTK requires specific software/packages, please install it/them according to the error information.

III. Download and Install

Please download the CSTK from the release page or clone it with git.

It's extremely easy to install CSTK, because all the source codes are written with script languages and no compilation needed. After installing the software/packages required in the "Prerequisites" section, CSTK is ready to be used.

Before using CSTK, firstly add the path of CSTK into the environment variable $PATH, in order to use scripts in CSTK at the command line directly. Assuming CSTK is decompressed into /home/<yourUserName/bin>/CSTK, run the following command:

PATH=/home/<yourUserName>/bin/CSTK:$PATH
chmod u+x /home/<yourUserName>/bin/CSTK/*.{pl,R,py,sh}

You can add the first command line into your configuration file of environment (in general, it's /home/<yourUserName>/.bashrc), so that the CSTK is available to be used after you login every time:

cat <EOF >>/home/<yourUserName>/.bashrc
PATH=/home/<yourUserName>/CSTK:$PATH
EOF

Rather than downloading and installing manually, it's recommended to setup up by Docker (https://hub.docker.com/r/zhangsjsky/cstk) or Singularity (https://cloud.sylabs.io/library/_container/5ed705dfd0ff9c878fea7e50). In this way you don't need to install the dependencies, which had been installed in the container.

# For Docker
docker pull zhangsjsky/cstk:latest
# For Singularity
singularity pull library://zhangsjsky/default/cstk

IV. Tutorial

4.1 Common Rules

The following rules are commonly applied in CSTK:

A) For script with only one input file, the input is passed in from STDIN (Standard Input), except that the input file is specified with argument; For script with only one output file, the output is printed to STDOUT (Standard Output), meanwhile the STDERR (Standard Error) may be used for log output; For script with 2 output files, the output may be printed to STDOUT and STDERR, respectively, or may be printed to STDOUT and the file specified with option parameter, respectively. Using STDOUT and STDERR preferentially to option is convenient to pipe the analysis procedures into a pipeline. Pipeline can also speed up the analysis, meanwhile avoid the immediate files reading and writing hard disk.

B) Help information of almost all the scripts can be viewed with -h, -help or --help options.

C) For option with value, option and value of Perl and Shell scripts are separated by a space character (e.g. -option value), while those of R are separated by an equal mark (e.g. -option=value).

4.2 Example

Only one or two scripts in each function module of CSTK are illustrated in following examples, illustration for more scripts will be appended according to users' feedbacks.

4.2.1 File Format Converting

fqConverter.pl

fqConverter.pl -h

Usage: perl fqConverter.pl input.fq >output.fq

If input.fq isn't specified, input from STDIN

If output.fq isn't specified, output to STDOUT

Option:

-s --source STR The quality format of your source fastq file ([fastq-illumina], fastq-solexa, fastq)

-t --target STR The quality format of your target fastq file (fastq-illumina, fastq-solexa, [fastq])

-h --help This help information screen

Since Fastq is the common format for bioinformatics, the content of input file is not illustrated here. For details of Fastq format, please visit wiki.

Running example:

zcat myReads.illumina.fq.gz | fqConverter.pl >myReads.sanger.fq

The command converts the fastq in illumina format (default of -s option) to the famous sanger format (default of -t option).

Pipeline is applied in the command to illustrate the advantage of piping the CSTK. You can also further improve it as:

zcat myReads.illumina.fq.gz | fqConverter.pl | gzip -c >myReads.sanger.fq.gz

In this way, the output fastq is stored as compressed gz file. In these analysis procedures, the CSTK script act only as an adapter in the pipeline, that is to pass in the input from the output of the previous procedure and print the output to the next procedure as its input.

gpe2bed.pl

gpe2bed.pl -h

Usage: perl gpe2bed.pl INPUT.gpe >OUTPUT.bed

If INPUT.gpe isn't specified, input from STDIN

Output to STDOUT

Option:

-b --bin With bin column

-t --bedType INT Bed type. It can be 3, 6, 9 or 12[12]

-i --itemRgb STR RGB color[0,0,0]

-g --gene Output 'gene name' in INPUT.gpe as bed plus column

-p --plus Output bed plus when there are additional columns in gpe

-h --help Print this help information

Assuming there is an input file (example.gpe) with the following contents:


76	NM_015113	chr17	-	3907738	4046253	3910183	4046189	55	3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835,	3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253,	ZZEF1	cmpl	cmpl	0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0,
147	NM_001308237	chr1	-	78028100	78149112	78031324	78105156	14	78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78105133,78107068,78107206,78148946,	78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78105287,78107131,78107340,78149112,	ZZZ3	cmpl	cmpl	2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1,
147	NM_015534	chr1	-	78028100	78148343	78031324	78099039	15	78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269,	78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343,	ZZZ3	cmpl	cmpl	2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1,-1,

As shown, there is a bin column (the first column) in the gpe file, so the -b option should be specified. For the description of gpe file, please visit UCSC: http://genome.ucsc.edu/FAQ/FAQformat.html#format9.

Run the script:

gpe2bed.pl -b example.gpe >example.bed

The content of the output bed file:


chr17	3907738	4046253	NM_015113	-	3910183	4046189	0,0,0	55	2526,72,154,166,99,169,144,360,136,101,192,120,200,133,175,278,140,151,152,263,166,139,130,162,120,165,469,94,80,241,149,202,82,167,123,122,160,115,68,134,170,101,216,112,149,92,99,179,117,211,200,172,195,145,418	0,4438,5159,9004,9645,9902,11878,12926,13391,15224,16684,18264,20474,27681,28383,29570,37984,39779,45263,46336,47526,49612,51771,53549,54726,58308,59916,62002,62718,66239,68163,69705,70652,70818,72192,72423,73438,76931,77992,81225,82041,82989,84233,86274,91386,92164,97872,100188,101248,105208,108164,109854,112527,119462,138097
chr1	78028100	78149112	NM_001308237	-	78031324	78105156	0,0,0	14	3369,101,135,153,96,102,72,116,148,139,154,63,134,166	0,3665,5916,13652,16358,17111,18582,19360,19563,22101,77033,78968,79106,120846
chr1	78028100	78148343	NM_015534	-	78031324	78099039	0,0,0	15	3369,101,135,153,96,102,72,116,148,139,1556,154,63,134,74	0,3665,5916,13652,16358,17111,18582,19360,19563,22101,69434,77033,78968,79106,120169

The -t is 12 in default, so the output bed file is bed12 format in default. You can also try the -g or -p option to modify or add columns in output.

4.2.2 Processing of Standard Format File

gpeMerge.pl

gpeMerge.pl -h

Usage: perl gpeMerge.pl input.gpe >output.gpe

If input.gpe not specified, input from STDIN

Output to STDOUT

-b --bin With bin column

-l --locus Merge with locus (default: merge gene)

-t --longTranscript Overlap is against long transcript (default against short transcript)

-n --name Set the "gene name" column as "transcript name(s)" when the corresponding gene name unavailable

-p --percent DOU Minimal overlap percent to merge transcript (default: 0)

-h --help Print this help information

This script can merge different transcripts of the same gene (or the same locus if the -l option specified). The merging criterion is: for each site of a gene, if in any transcript the site is located in exon, the site is treated as exonic site in the merged result, otherwise treated as intronic site.

A diagram to intuitively illustrate the merging:

From C.

In the figure, the top and middle lines are the two transcripts of the same gene, the bottom line is the result after merging.

Use the example.gpe file in the previous section as input to run this script:

gpeMerge.pl -b example.gpe >merged.gpe

The content of the output:


6	NM_015113	chr17	-	3907738	4046253	3910183	4046189	55	3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835,	3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253,	0	ZZEF1	cmpl	cmpl	0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0,
147	NM_001308237,NM_015534	chr1	-	78028100	78149112	78028100	78149112	16	78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269,78148946	78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343,78149112	0	ZZZ3	unk	unk	-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,

As you can see, the two records of the transcripts of the ZZZ3 gene have been merged into one record. The start and end of exons are updated as the coordinates after merging and the other information associated with coordinate is also updated according to the coordinates after merging. The column of transcript name is also updated as comma-separated transcript list.

gpeFeature.pl

gpeFeature.pl -h

Usage: perl gpeFeature.pl OPTION INPUT.gpe >OUTPUT.bed

If INPUT.gpe isn't specified, input from STDIN

Example: perl gpeFeature.pl -b -g hg19.size --upstream 1000 hg19.refGene.gpe >hg19.refGene.bed

Option:

-b --bin With bin column

-i --intron Fetch introns in each transcript

-e --exon Fetch exons in each transcript

-c --cds Fetch CDS in each transcript

-u --utr Fetch UTRs in each transcript, 5'UTR then 3'UTR (or 3' first)

-p --prime INT 5 for 5'UTR, 3 for 3'UTR(force -u)

--complete Only fetch UTR for completed transcripts

--upstream INT Fetch upstream INT intergenic regions(force -g)

--downstream INT Fetch downstream INT intergenic regions(force -g)

-g --chrSize FILE Tab-separated file with two columns: chr name and its length

-s --single Bundle all features into single line for each transcript

--addIndex Add exon/intron/CDS/UTR index as suffix of name in the 4th column

-h --help Print this help information

This script is used to extract specific feature from the gpe file and output in bed format.

For example, to extract exons:

gpeFeature.pl -b example.gpe -e >exon.bed6+


chr17	3907738	3910264	NM_015113	-	NM_015113	2526	ZZEF1
chr17	3912176	3912248	NM_015113	-	NM_015113	72	ZZEF1
chr17	3912897	3913051	NM_015113	-	NM_015113	154	ZZEF1

More records of the output are omitted. The result is in bed6+ format with each line representing an exon. The last two columns present the exon length and gene name.

4.2.3 Table Manipulation

tsvFilter.pl

tsvFilter.pl -h

Usage: perl tsvFilter.pl -o originFile.tsv -1 1,4 -m i|include targetFile.tsv >filtered.tsv

If targetFile.tsv isn't specified, input is from STDIN

Output to STDOUT

Option:

-o --originFile TSV The original file containing fields (specified by --originFields) used to include or exclude lines in targetFile.tab

-1 --originFields STR Comma-separated field list specifying which fileds in the originFile.tab to be used to include or exclude, 1-based start [1]

The element of the list can be a single column number or a range with nonnumeric char as separator

To specify the last column left the range right margin blank

If continuous range specified like '1-3-6', the first range '1-3' will be output

e.g.:

-1 1,4 output columns 1,4

-1 1-4,6..8 output columns 1,2,3,4,6,7,8

-1 1,4,6- output columns 1,4,6,7,... last column

-1 1-3-6 output columns 1,2,3

-2 --targetFields STR Comma-separated field list specifying which fileds in the targetFile.tab are used to include or exclude lines, 1-based start [1]

More description about --targetFields, see --originFields

-m --mode STR To include or exclude lines in targetFile.tab, it can be i|include or e|exclude[e]

-s --separator STR (Optional)A separator to join the fields specified, if necessary[Empty string]

-h --help Print this help information

This script is often used in table manipulation. It's function is to filter a table (i.g. target table) according to one column or some columns (i.g. the target columns) with one or some columns (i.g. the source column) of another table (i.g. source table). The mode on it's based is whether target columns include (or exclude) the source columns.

Assuming the first column of the source table (source.tsv) is gene name, the second column is up- or down- regulation mark:


Gene1	Up
Gene2	Down
Gene3	Up

The first column of the target table (target.tsv) is gene name, the second column is up- or down- regulation fold-change:


Gene1	5
Gene2	10
Gene3	2

If you want to know the fold-change of the up-regulated genes, run the following command:

tsvFilter.pl -o <(awk '$2=="Up"' source.tsv) -m i target.tsv

The output is:


Gene1	5
Gene3	2

The command fetches up-regulated genes with awk firstly, and then feeds it to the -o option as input to filter the target.tsv file. -m option specifies the filtering mode as "include". Because both the source column and target column are the first column, -1 and -2 options are specified as 1 in default.

tsvJoin.sh

tsvJoin.sh

Usage: tsvJoin.sh OPTIONS [-1 field1][-2 fields2] input1.tsv [input2.tsv]

Note: input2.tsv can be omitted or be a "-" to input the data from STDIN.

when input1.tsv is a "-" (i.e. from STDIN), input2.tsv must be specified.

Options:

-1|field1 STR The field in input1.tsv used when joining[1]

Refer to the -1 option of linux join command

-2 field2 STR The field in input2.tsv used when joining[1]

-i|inputDelimiter STR The delimiter of your input file[\t]

-j|joinDelimiter STR The delimiter used by linux join commond (i.e. the -t option of join command)[|]

-o|outputDelimiter STR The delimiter of the output[\t]

-a|unpairedFile INT Also print unpairable lines from file INT (1, 2)

Please specify at least one file

This script is also often used in table manipulation. It's function is to join two tables according to whether the values of one or some columns are the same. Different from the join command built in Linux, this script can use the tab character as separated character.

Take the previous source.tsv and target.tsv as examples, join them according to gene name:

tsvJoin.sh source.tsv target.tsv >join.tsv 2>/dev/null

The output join.tsv is:


Gene1	Up	5
Gene2	Down	10
Gene3	Up	2

In the command, '2>/dev/null' means discarding the STDERR (for output running information). Because the columns used to join in the two input table are both the first column, -1 and -2 options are specified as 1 in default.

In the above output, the column used to do joining (i.g. the gene name column) is set as the first column and is followed by the other columns in the first and the second files, except for the column used to do joining.

4.2.4 Statistical Testing

testT.R

testT.R -h

Usage: testT.R -option=value <input.lst|<input1.lst input2.lst|input1.lst input2.lst >pValue

Option:

-p|pair Pair

-a|alt STR The alternative hypothesis: [two.sided], greatr or less

-h Show help

This script is used to conduct T-test testing. '<input.lst|<input1.lst input2.lst|input1.lst input2.lst' means there are a few input manners:

A) Input one file from STDIN;

B) Input from STDIN and file (specified by an argument), respectively;

C) Input from two files (specified by two arguments), respectively.

The commonly used manner is the last one. Assuming the content of the first file (value1.lst) is:

1

2

3

4

5

And that of the second file is:

3

4

5

6

7

8

testT.R value1.lst value2.lst

0.0398030820241363

The output p-value indicates there is significant difference between the two list of values.

testWilcoxon.R

testWilcoxon.R -h

Usage: testWilcoxon.R -option=value <input.lst|input.lst|input1.lst input2.lst >pValue

Options

-a|alt STR The alternative hypothesis ([two.sided], greater, less)

You can specify just the initial letter

-m|mu DOU A parameter used to form the null hypothesis for one-sample test[0]

-h|help Show help

testWilcoxon.R value1.lst value2.lst 2>/dev/null

0.06601543

The output p-value indicates there isn't significant difference between the two list of values. The conclusion is different from the T-test.

4.2.5 Survival Analysis

survival.R

survival.R -h

Usage: survival.R -option=value <input.tsv

Option:

-p|pdf PDF The KM figure[KM.pdf]

-w|width INT The figure width

height INT The figure height

header With header

-m|main STR The main title

-x|xlab STR The xlab[Time]

-y|ylab STR The ylab[Survival Probability]

-h Show help

Input (header isn't necessary):

Example1:

#time event

1 TRUE

2 TRUE

2 TRUE

8 FALSE

5 TRUE

10 FALSE

Example2:

#time event group

1 TRUE male

2 TRUE male

2 TRUE male

8 FALSE male

5 TRUE female

10 FALSE female

As the help information presents, creating an input file as the Example2 (survival.tsv). Pass it to the script as input and specify the output pdf file:

survival.R -p=survival.pdf <survival.tsv >survival.log 2>survival.err

Survival curve (i.g. Kaplan-Meier curve) and the log-rank p-value are stored in the pdf. Statistical measurements are stored in survival.log.

4.2.6 Data Visualization

bar.R

bar.R -h

Usage: bar.R -p=outputName.pdf <input.tsv

Option:

Common:

-p|pdf FILE The output figure in pdf[figure.pdf]

-w|width INT The figure width

-height INT The figure height

-m|main STR The main title

Contents omitted…

-annoTxt STRs The comma-separated texts to be annotated

-annoTxtX INTs The comma-separated X positions of text

-annoTxtY INTs The comma-separated Y positions of text

-annoTxtS DOU The annotated text size[5]

Skill:

Legend title of alpha, color, etc can be set as the same to merge their guides

This script is used to draw bar chart. Take the survival.tsv in the previous section as input to draw:

awk 'BEGIN{OFS="\t"}{print NR,$1,$3}' survival.tsv | bar.R -p=bar.pdf -fillV=V3 -x='Patient ID' -y=Time -fillT=Gender

In the command, awk is used to process the input before passing the data to bar.R. The processed results are:


1	1	male
2	2	male
3	4	male
4	8	male
5	5	female
6	10	female

The extra patient ID information is added as the first column and will be presented at X axis. The second column is survival time and will be presented at Y axis. The third column is gender and will be used to fill bar with different color. If the third column isn't presented, all the bars are filled in black in default.

In the parameters of bar.R, -fillV=V3 tells it to use the third column (column name: V3) to color bars. Don't specify this option if no third column. -x and -y specify the label of X and Y axis, respectively. -fillT specify the title of the fill-legend.

hist.R

hist.R -h

Usage: hist.R -p=outputName.pdf <input.tsv

Option:

Common:

-p|pdf FILE The output figure in pdf[figure.pdf]

-w|width INT The figure width

-m|main STR The main title

-mainS DOU The size of main title[22 for ggplot]

-x|xlab STR The xlab[Binned Values]

-y|ylab STR The ylab

Contents omitted…

-annoTxt STRs The comma-separated texts to be annotated

-annoTxtX INTs The comma-separated X positions of text

-annoTxtY INTs The comma-separated Y positions of text

Skill:

Legend title of alpha, color, etc can be set as the same to merge their guides

This script is used to draw histogram.

hist.R -p=hist.pdf -x=Time -y='Patient Count' <survival.tsv

Only one column of values is needed to draw a histogram, so pass in survival.tsv as input then the first column will be used default to draw histogram.

4.2.7 Parsing of Third-party Tool Result

coverageBedParser.pl

coverageBedParser.pl -h

Usage: perl coverageBedParser.pl coverageBedOutput.tsv >OUTPUT.tsv

If coverageBedOutput.tsv isn't specified, input from STDIN

Option:

-b -bedFormat INT Bed format ([3], 6)

-h --help Print this help information

This script is used to parse the output result of 'bedtools coverage', calculate the mean depth and covered fraction of each region.

coverageBedParser.pl coverage.bed >coverageParsed.bed

fastqcParser.pl

fastqcParser.pl -h

Usage: perl fastqcParser.pl OPTION fastqc_data.txt >OUTPUT.tsv

If fastqc_data.txt isn't specified, input from STDIN

Options:

-h --help Print this help information

This script is used to parse the result (in general, the fastqc_data.txt) of FastQC, extract read-count, GC content and mean quality etc. and output as a table.

fastqcParser.pl fastqc_data.txt >fastqcDataParsed.tsv

4.2.8 Gene/Region Expression Quantification

geneRPKM.pl

geneRPKM.pl -h

Usage: perl geneRPKM.pl -g gene_structure.gpe -s 4 INPUT.BAM >RPKM.bed6+ 2>running.log

If INPUT.BAM isn't specified, input is from STDIN

Output to STDOUT in bed6 (gene in name column, RPKM in score column) plus longest transcript, readNO and transcript length

This script chooses the LONGEST transcript of each gene as reference transcript to measure RPKM

Option:

-g --gpe FILE A gpe file with comment or track line allowed

-b --bin With bin column

-l --libType STR The library type, it can be

fr-unstranded: for Standard Illumina (default)

fr-firststrand: for dUTP, NSR, NNSR

fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol

-s --slop INT Specify the slopping length from the exon-intron joint to intron[0]

-u --uniq Only use uniquely-mapped reads (NH=1)to compute RPKM

--log FILE Record running log into FILE

-h --help Print this help information

As the help information describes, this script chooses the longest transcript of each gene as the reference to quantify the expression of each gene in RPKM. The output result is stored in bed6+. A running example:

geneRPKM.pl -g refGene.gpe -b -s 4 final.bam >RPKM.bed6+ 2>geneRPKM.log

In the command, -s option specifies the length in which reads can extend from exon into intron. This option is devised to handle the situation that reads may be incapable of spanning an intron to be aligned to adjacent exon with a few base-pair.

This script may consume too much memory (depend on the size of the input bam file). If the memory isn't enough on the machine, the script geneRPKM_mem.pl, which consumes little memory but runs slower, may be an alternative choice.

regionRPKM.pl

regionRPKM.pl -h

Usage: perl regionRPKM.pl -b region.bed INPUT.bam >RPKM.bed 2>running.log

If INPUT.bam isn't specified, input from STDIN

Output to STDOUT with bed columns plus reads count in region and its RPKM

Note: INPUT.bam should be indexed with samtools index

This script is for handling bam file in normal size that can be entirely cached into memory.

It's MEMORY-CONSUMED but low TIME-CONSUMED compared to its equivalent regionRPKM_mem.pl.

Spliced reads are handled now. Those that include the whole region within intron aren't counted.

Option:

-b|bedFile FILE Region file in bed4 or bed6 format. bed plus is allowed.

-l|libType STR The library type, it can be

fr-unstranded: for Standard Illumina (default)

fr-firststrand: for dUTP, NSR, NNSR

fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol

-h --help Print this help information

This script calculates the RPKM of specific regions (specified by --bedFile option). A running example:

regionRPKM.pl --bedFile exon.bed final.bam >RPKM.bed 2>regionRPKM.log

Similarly, if the memory isn't enough, use regionRPKM_mem.pl as alternative.

4.2.9 Alternative Splicing Identification and Quantification

Only identification and quantification of SE (Skipping Exon) event are implemented in current version.

psiSE.pl

psiSE.pl -h

Usage: perl psiSE.pl INPUT.bam >OUTPUT.bed6+

If INPUT.bam isn't specified, input from STDIN

Option:

-b --bed FILE Gene models in bed12 format

-l --libraryType STR The library type, it can be

fr-unstranded: for Standard Illumina (default)

fr-firststrand: for dUTP, NSR, NNSR

fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol

-s --slop INT Maximal slope length for a read to be considered as exonic read[4]

-r --minRead INT Minimal supporting reads count for an exclusion junction[2]

-h --help Print this help information

Output:

The 4th column is the transcript name and the exon rank (in transcriptional direction) separated by a dot.

The 5th column in OUTPUT.bed6+ is the PSI normalized into 0-1000.

Additional columns are as follow:

inclusion read count

inclusion region length

inclusion read density

exclusion read counts separated by comma

exclusion region lengths separated by comma

exclusion read density

This script is used to identify SE events and quantify the PSI of them from the alignments. An using example:

psiSE.pl final.bam >SE.bed6+

Refer to the help information for the format of the output result.

V. Get Help

You can send the author Sky any information about this toolkit, like bug reporting, typos and performance improvement suggestion.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
KEGGmapping		KEGGmapping
example		example
pm		pm
.gitignore		.gitignore
ABSOLUTE.R		ABSOLUTE.R
C-index.R		C-index.R
CNTools.R		CNTools.R
GOmapping.py		GOmapping.py
LICENSE		LICENSE
Merging_Illustration.jpeg		Merging_Illustration.jpeg
NMF.R		NMF.R
README.md		README.md
RELEASE		RELEASE
SomaticSignatures.R		SomaticSignatures.R
SymbolKEGG.py		SymbolKEGG.py
addTail.pl		addTail.pl
aheatmap.R		aheatmap.R
annovarIndex.pl		annovarIndex.pl
asciifyimage.pl		asciifyimage.pl
bam-readcount2depth.pl		bam-readcount2depth.pl
bar.R		bar.R
bed2gpe.pl		bed2gpe.pl
bed6ToBed12.pl		bed6ToBed12.pl
bedClosest.pl		bedClosest.pl
bedCluster.pl		bedCluster.pl
bedFeature.pl		bedFeature.pl
bedFeaturing.pl		bedFeaturing.pl
bedFilter.pl		bedFilter.pl
bedLength.pl		bedLength.pl
bedtoolHistParser.pl		bedtoolHistParser.pl
bg2wig.pl		bg2wig.pl
bgFilter.pl		bgFilter.pl
bgPileUp.pl		bgPileUp.pl
bgSubtract.pl		bgSubtract.pl
bgTrim.pl		bgTrim.pl
bin2d.R		bin2d.R
binBox.R		binBox.R
binHex.R		binHex.R
box.R		box.R
boxes.R		boxes.R
colPairCorrelation.R		colPairCorrelation.R
common.R		common.R
complexHeatmap.R		complexHeatmap.R
concurrency.sh		concurrency.sh
contour.R		contour.R
correlation.R		correlation.R
coverageBedParser.pl		coverageBedParser.pl
cumLine.R		cumLine.R
cumLines.R		cumLines.R
deconstructSigs.R		deconstructSigs.R
density.R		density.R
density2D.R		density2D.R
dep2bg.pl		dep2bg.pl
distrCurve.R		distrCurve.R
distrCurves.R		distrCurves.R
dot.R		dot.R
ecdf.R		ecdf.R
enhancedVolcano.R		enhancedVolcano.R
fa2bed.pl		fa2bed.pl
fa2fq.pl		fa2fq.pl
fa2primer3.pl		fa2primer3.pl
faFormater.pl		faFormater.pl
faPbHeader.pl		faPbHeader.pl
fastqcParser.pl		fastqcParser.pl
fq2fa.pl		fq2fa.pl
fqAdapterFilter.pl		fqAdapterFilter.pl
fqConverter.pl		fqConverter.pl
fqCopy.pl		fqCopy.pl
fqDuplicate.pl		fqDuplicate.pl
fqFormater.pl		fqFormater.pl
fqNameMapping.pl		fqNameMapping.pl
fqPeFilter.pl		fqPeFilter.pl
fqQualProfiling.pl		fqQualProfiling.pl
fqSeFilter.pl		fqSeFilter.pl
fqSplit.pl		fqSplit.pl
fqTrimPA.pl		fqTrimPA.pl
fqTrimer.pl		fqTrimer.pl
freec2segment.pl		freec2segment.pl
geneFPKM.pl		geneFPKM.pl
geneRPKM.pl		geneRPKM.pl
geneRPKM_mem.pl		geneRPKM_mem.pl
gff3ToBed.pl		gff3ToBed.pl
gpe2bed.pl		gpe2bed.pl
gpeFeature.pl		gpeFeature.pl
gpeFillFrame.pl		gpeFillFrame.pl
gpeFindClosest.pl		gpeFindClosest.pl
gpeMerge.pl		gpeMerge.pl
gpeParser.pl		gpeParser.pl
group.pl		group.pl
heatmap2.R		heatmap2.R
hist.R		hist.R
hists.R		hists.R
hive.pl		hive.pl
juncAssign.pl		juncAssign.pl
juncCluster.pl		juncCluster.pl
line.R		line.R
lineCollapse.pl		lineCollapse.pl
lines.R		lines.R
listPairComparison.sh		listPairComparison.sh
lm.R		lm.R

License

zhangsjsky/CSTK

Folders and files

Latest commit

History