Malformed file from `to-mr` #50

nickp60 · 2020-11-12T22:35:08Z

Hi, I am trying to troubleshoot an issue with using the to-mr executable to covert a bam file generated with bwa mem. I am using preseq in a docker container build from the following dockerfile:

FROM ubuntu:20.04
ARG VER="3.1.1"
RUN apt-get update && apt-get install wget autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev g++ -y
RUN wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar .bz2 && \
    tar -vxjf htslib-1.9.tar.bz2 && \  
    cd htslib-1.9 && \
    make && \
    make install
ENV LD_LIBRARY_PATH /htslib-1.9/
RUN wget https://github.com/smithlabcode/preseq/releases/download/v${VER}/pres\
eq-${VER}.tar.gz &&\
    tar xzf preseq-${VER}.tar.gz &&\
    cd preseq-${VER} &&\
    mkdir build && cd build &&\
    ../configure --enable-hts  &&\
    make &&\
    make install

I ran this on a file containing about 5M reads, and got the error below. for reproducing the error, I took a subset of the BAM, and it is attached.

I ran gc_extrap as follows and got the following error:

$  to-mr -o /data/sample.mr /data/sample.bam -L 10000 -v
[input file: /data/sample.bam]
[output file: /data/sample.mr]

$ preseq gc_extrap -o /data/sample.curve /data/sample.mr -v 
LOADING READS
MAPPED READ FORMAT
ERROR:	bad line in MappedRead file: (NULL)	0	0	X	0	+

Looking through the MR file, it seems that the following reads are causing the problem:

A00758:111:HL2FCDSXY:4:2369:32416:19554	131	chr1	16788	0	75M	=	17055	268	ACTAGGGAAGAAGGTGTG
TGACCAGGGAGGTCCCCGGCCCAGCTCCCATCCCAGAACCCAGCTCACCTACCTTGA	FFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF
FFFFFFFFFFFFFFFFF	NM:i:1	MD:Z:0T74	MC:Z:75M	AS:i:74	XS:i:74	XA:Z:chr1,+187310,75M,1;chr2,-113596574,74
M1S,1;chr16,+16472,1S74M,1;chr15,-101974098,75M,2;chr9,+16900,1S74M,1;
A00758:111:HL2FCDSXY:4:2369:32416:19554	67	chr1	17055	0	75M	=	16788	-268	CCTGCAAGATTAGGCAGG
GACATGTGAGAGGTGACAGGGACCTGCAGGGGCAGCCAACAAGACCTTGTGTGCACC	FF:FFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:75	MC:Z:75M	AS:i:75	XS:i:75	XA:Z:chrX,-156022938,75M,0;chr16,+16738,75M,0;chr2
,-113596307,75M,0;chr12,+17170,75M,0;chr9,+17166,75M,0;chr15,-101973831,75M,0;chr1,+187577,75M,0;chr12_GL877875v1_alt,+717
0,75M,0;

Running with bam2mr warns of several segment length issues, but results in an output file containing the passing reads:

~/bin/preseq_v2.0/bam2mr -o sample.old.mr sample.bam -v -seg_len 100000 2> old.log

Those files are in the attached zip as well.

Is there something wrong with my BAM file? I have been using preseq 2.0.3 for a while (and bam2mr), and am trying to update to 3.1.1. Converting the BAM to BED with bedtools's bedToBam and running preseq results in no errors.

Thanks in advance!

tmp.zip

The text was updated successfully, but these errors were encountered:

terencewtli · 2020-11-17T07:21:11Z

Hi,

Thanks for posting this issue! I'm able to reproduce your results, and it looks like there's an issue with our to-mr program not properly accounting for paired-end mates that are too far apart. I think your file is fine - as you mentioned, bam2mr and bedToBam don't result in issue with the NULL line, and we've actually experienced this issue with to-mr in the past. We recently fixed this issue in another one of repositories using another utility - I will let you know when we update the program. Sorry in advance!

nickp60 · 2020-11-19T15:10:22Z

Thanks for looking into it @terencewtli ! While we are at it, it would be great to get a rundown of what to-mr or bam2mr are actually doing. In our hands we have noticed that preseq's gc_extrap predictions are influenced by the number of input reads, but it is unclear how reads with different SAM flags are handled. One dataset for instance:

5805263 + 0 in total (QC-passed reads + QC-failed reads)
337398 + 0 secondary
11553 + 0 supplementary
0 + 0 duplicates
5792861 + 0 mapped (99.79% : N/A)
5456312 + 0 paired in sequencing
2728156 + 0 read1
2728156 + 0 read2
4863978 + 0 properly paired (89.14% : N/A)
5436892 + 0 with itself and mate mapped
7018 + 0 singletons (0.13% : N/A)
142394 + 0 with mate mapped to a different chr

# primary reads
$ samtools view -c  -F 256  5003-JW-3.md.n5m.000.bam 
5467865

Looking at the line count, I see around 3M, which makes ballpark sense as its paired end sequencing and the MR file should be a mix of mostly pairs and some singletons:

 $bam2mr -o /data/5003-JW-3.md.n5m.000.bam.mr /data/5003-JW-3.md.n5m.000.bam
 $ wc -l ./5003-JW-3.md.n5m.000.bam.mr
3023474 ./5003-JW-3.md.n5m.000.bam.mr

However, its unclear which reads are included in the MR file. This sequencing run has 151 bp reads, so I looked at the MR fragment length in relation to that, to approximate the number of singletons and pairs:

cat 5003-JW-3.md.n5m.000.bam.mr | awk '{$4=($3-$2)>160} {print $4}' | grep 1 | wc -l 
2247473

$cat 5003-JW-3.md.n5m.000.bam.mr | awk '{$4=($3-$2)>160} {print $4}' | grep 0 | wc -l 
776001

So (2247473*2) + 776001 is 5270947, which differs from 5467865. I know there are a lot of complexities with how the reads are represented in both the BAM/SAM formats and the MR format, but any insight you could give here would be great.

vmkalbskopf · 2022-02-11T12:58:59Z

I'm getting a very similar error with to-mr generating invalid mr files.

ERROR: bad line in MappedRead file: (NULL) 0 0 M03562:43:000000000-BLRTF:1:2116:8312:7359 0 -

I deleted that line (the first line) and still got an error:

ERROR: bad line in MappedRead file: PRELSG_01_v1 113 259 M03562:43:000000000-BLRTF:1:2116:8312:7359 0 + CGACGTCGCTTTTTGATCCTTCGATGTCGGCTCTTTCTTTCATTTTGAAGCAGAATTCACCAAGCGTTGGTTTGTTCACCCACTAATAGGGAACGTGTGCTGGGTTTAGACCGTCGTGAGACAGGTTAGTTTTACCCTACTGATGA

The mr file has other issues too. Here is one of the following lines:

PRELSG_01_v1 146 259 FRAG:M03562:43:000000000-BLRTF:1:2106:6581:1365 0 - TCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGATCCCTATTAGTGGGTGAACAATCCAACGCTTGGTGAATTCTGCTTCACAATGATAGGAA ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^A^A^A<8F><FF><FF><FF>^@^@^@^@^@^@^@^@TCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGATCCCTATTAGTGGGTGAACA

vmkalbskopf · 2022-02-11T13:01:00Z

I'd like to try using bam2mr, but I can't build an old version of preseq. I get a make error:

`fatal error:
'string' file not found

#include
^~~~~~~~`

Is there a way I can download a binary for bam2mr?

andrewdavidsmith · 2022-02-11T20:07:45Z

@vmkalbskopf I'll do my best to respond, but would it be possible to separate these issues? I would like to help you get things working asap, but I'd still like a record of the individual errors. Also, I can't promise a binary, but if you email me directly we might find a work-around.

vmkalbskopf · 2022-02-14T09:26:58Z

@andrewdavidsmith so kind of you to respond so quickly.

I will create a separate issue for the garbled mr file.
What's your email address?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed file from `to-mr` #50

Malformed file from `to-mr` #50

nickp60 commented Nov 12, 2020

terencewtli commented Nov 17, 2020 •

edited

nickp60 commented Nov 19, 2020

vmkalbskopf commented Feb 11, 2022

vmkalbskopf commented Feb 11, 2022 •

edited

andrewdavidsmith commented Feb 11, 2022

vmkalbskopf commented Feb 14, 2022

Malformed file from to-mr #50

Malformed file from to-mr #50

Comments

nickp60 commented Nov 12, 2020

terencewtli commented Nov 17, 2020 • edited

nickp60 commented Nov 19, 2020

vmkalbskopf commented Feb 11, 2022

vmkalbskopf commented Feb 11, 2022 • edited

andrewdavidsmith commented Feb 11, 2022

vmkalbskopf commented Feb 14, 2022

Malformed file from `to-mr` #50

Malformed file from `to-mr` #50

terencewtli commented Nov 17, 2020 •

edited

vmkalbskopf commented Feb 11, 2022 •

edited