Skip to content
kpalin edited this page Oct 9, 2012 · 2 revisions

SLRP tutorial

This page will give some guidance on how to get started with using SLRP. The examples are based on files in the "examples/" directory inside the SLRP folder.

The "examples/" directory contains input files cM12genNoFamPop100.tped and cM12genNoFamPop100.vcf.gz and a file cM12genNoFamPop100.haplotypes.vcf.gz containing the true phasing for the genotypes in other files. The data is the same as used in Palin et.al., briefly, random sample of chromosome 20 of 190 individuals from a panmixing population starting from 100 hapmap CEU founders growing exponentially to 17887 in 12 generations, without mutations and recombinations added according to fine scale map.

Following examples assume that SLRP executable script is in your PATH and the SLRP libraries are available to the python interpreter (PYTHONPATH is set appropriately. More about installing SLRP can be found from the requirements page.

Example: Phasing

Following command will phase the provided example file:

SLRP --vcfFile examples/cM12genNoFamPop100.vcf.gz \
     --fastPreProc \
     --ibdSegCalls SLRP.ibd \
     --outVCF cM12genNoFamPop100.SLRP.vcf  --verbose

There will be quite a lot of output on the screen but after a successful run, there will be three new files in the current directory: SLRP.ibd.aibd contains the preliminary IBD segments (probably not very useful), SLRP.ibd has the IBD segments inferred in the end, and cM12genNoFamPop100.SLRP.vcf will have the phased haplotypes. You can compare this file with examples/cM12genNoFamPop100.haplotypes.vcf.gz (e.g. with vcftools --diff-switch-error) and you should find about 3 switch errors per individual on average. (File name mixup for vcf files fixed in repository 9.10.2012)

Comments

You can also give input as plink tped/tfam fileset by replacing --vcfFile option by --tpedFile. Still, I would recommend sticking with VCF files due to their myriad other benefits.

--fastPreProc is not strictly necessary but it speeds up things and doesn't hurt the accuracy much.

The IBD segment files are formatted in 7 columns separated by tabs. First two columns are the IBD haplotypes, with number 2*k+p being the p:th haplotype of k:th individual in the list of individuals (As in e.g. the header of VCF file). Third column is the cM length of the segment (or 'a score' in the preliminary segment file). Columns 4 and 5 are the first and the last marker of the segment and 6. and 7. columns are the fist and the last basepair of the segment (i.e. the positions of the first and the last marker) This format is not very good and is likely to change at some point.