Bamtools unable to write into split files #135

asenabouth · 2016-10-28T01:03:28Z

I am trying to split a large BAM file into smaller bam files by tags that contain a cellular barcode. Bamtools is able to generate the files, but is unable to write the sequences into them.

I used the following command:
bamtools split -tag CB -in possorted_genome_bam_1.bam

It generated a bunch of empty bam files. This is an example of the filename it outputted:
possorted_genome_bam_1.TAG_CB_TTTGACTGCCCTCA-1.bam

It returned the following error message:

bamtool split ERROR: could not open possorted_merged_H32MNBGXY.TAG_CB_ACGATGACGATACC-1.bam for writing.

Is it possibly having trouble due to the characters in its file string? Thanks

coutuc · 2016-11-15T20:34:16Z

I am having exactly the same issue...following

Flomass · 2017-02-28T10:11:19Z

I'm also having exactly the same issue. Did you find a solution?
I found a really dumb way to do the same thing converting to sam files and using grep, but it surely takes much longer.
Any suggestion?
Thanks in advance

Hoohm · 2017-06-23T12:01:24Z

Having the same issue with a dropseq pipeline.

asenabouth · 2017-06-24T11:32:51Z

I still haven't figured out a solution, but I believe I have found a possible cause - it may be linked to the headers of the BAM files. I mainly have the issue with BAM files generated by 10x's Cell Ranger pipeline. I was able to split bam files from other sources.

Hoohm · 2017-07-02T18:05:17Z

I'm using dropseq data from the McCarroll pipeline and getting the same problem.
Haven't been able to figure out why either :(

relyanow · 2017-08-16T21:01:17Z

I was having this issue too (on 10X single cell data). I think that the issue is bamtools exceeding the limit on the number of open files (check 'ulimit -n'). Even if the expected number of cells is smaller than the file limit, the number of files created may be much larger due to sequencing errors creating many extra cellular barcodes.

marong0511 · 2018-01-01T18:59:34Z

I am having this issue on 10x single cell data too. Do you know how to split the bam file from other sources?

marong0511 · 2018-01-01T19:01:05Z

@asenabouth I have the same issue with you. Could you share the sources which you used to split the bam file? Thanks a lot!

asenabouth · 2018-01-12T06:00:38Z

Sorry @marong0511 - I meant that I was able to split BAM files that were not generated with Cell Ranger. I really do think it's an issue with bamtools exceeding the number of open files. I haven't come across a solution to reliably split BAM files produced by Cell Ranger; instead I've been working with the whole file and selecting alignments by regions I was interested in using Python. You can access the cell barcodes via the "CB" tag (refer to pysam http://pysam.readthedocs.io/en/latest/usage.html).

jvhaarst · 2018-02-16T13:12:45Z

To write a valid BAM file, bamtools opens a file for each barcode, writes the header, and then starts adding the reads that match the barcode.
The problem is that keeping a file open costs resources, so the operating systems limits the amount of files that can be open at the same time.
As the BAM file with 10X data can easily hold a million different barcodes, the above method fails fast.

Fixing this isn't easy, as for other users, the current way might work, and other ways to do this are probably a lot slower.

As a workaround, I have written a very small python script, which will split the single BAM file into ~1M separate files, one for each barcode.
As the filesystem doesn't like 1M files in a single folder, it creates subfolders :

import os
import fileinput
for line in fileinput.input():
    line=line.strip()
    tags = line.split()[11:]
    for tag in tags:
        tag_split=tag.split(':')
        if 'BX' in tag_split:
            barcode=tag_split[-1]
            directory = barcode[:5]
            outfile = directory+"/"+barcode
            try:
                f=open(outfile,"a+")
                print(line, file=f)
                f.close()
            except:
                os.mkdir(directory)
                f=open(outfile,"a+")
                print(line, file=f)
                f.close()

Usage :
samtools view BAM-file | python split_script.py
(this will take a while)
After that the header needs to be added to each file to be able to use it.

asenabouth · 2018-02-19T00:50:25Z

Hi @jvhaarst - nice work around, will give that a try when I have a chance.

fluffy-critter · 2019-03-14T20:09:17Z

Out of curiosity, instead of holding the files open for writing, could they be opened for appending only when they need to be written to, perhaps with a "connection pool" of sorts which only holds on to, say, the getrlimit(RLIMIT_NOFILE)/2 most recently accessed files? This would allow bamtools to still be much more efficient while not exceeding the system resource limits.

herrinca · 2019-10-09T04:54:56Z

Relatively quick option for spitting barcodes with <<1M unique barcodes. Script will split on any unique CB entry, but has only been tested with .bam files sorted with samtools on tag CB. To sort on CB tag:

samtools sort -t CB unsorted.bam > sorted_tags.bam

Script uses pysam to read bam file without need for conversion, details for conda install of pysam can be found here.

##### Code has not been tested on unsorted bam files, sort on barcode (CB):
##### samtools sort -t CB unsorted.bam > sorted_tags.bam
###
##### INPUT: .bam file to be sorted and output directory to place split BC
##### OUTPUT: .bam file for each unique barcode, best to make a new directory

### Python 3.6.8
import pysam

### Input varibles to set
# file to split on
unsplit_file = "/path/to/dir/of/data/sorted_tags.bam"
# where to place output files
out_dir = "/path/to/dir/of/out_data/"

# variable to hold barcode index
CB_hold = 'unset'
itr = 0
# read in upsplit file and loop reads by line
samfile = pysam.AlignmentFile( unsplit_file, "rb")
for read in samfile.fetch( until_eof=True):
    # barcode itr for current read
    CB_itr = read.get_tag( 'CB')
    # if change in barcode or first line; open new file  
    if( CB_itr!=CB_hold or itr==0):
        # close previous split file, only if not first read in file
        if( itr!=0):
            split_file.close()
        CB_hold = CB_itr
        itr+=1
        split_file = pysam.AlignmentFile( out_dir + "CB_{}.bam".format( itr), "wb", template=samfile)

    # write read with same barcode to file
    split_file.write( read)
split_file.close()
samfile.close()

After downloading script and changing name to split_script.py, change input file and output directory in script to desired locations and run:

python split_script.py

or if still using python2

python3 split_script.py

Runtime was about ~35sec for a test run ~1 million reads and ~5k unique barcodes.

ivanov-v-v · 2020-04-21T14:09:53Z

Relatively quick option for spitting barcodes with <<1M unique barcodes. Script will split on any unique CB entry, but has only been tested with .bam files sorted with samtools on tag CB. To sort on CB tag:

samtools sort -t CB unsorted.bam > sorted_tags.bam

Script uses pysam to read bam file without need for conversion, details for conda install of pysam can be found here.
##### Code has not been tested on unsorted bam files, sort on barcode (CB):
##### samtools sort -t CB unsorted.bam > sorted_tags.bam

Dear @herrinca, thank you for posting this solution! While your Python-code is correct, I've found out that samtools sort -t CB removes CB tag from file: CB_itr = read.get_tag( 'CB') line throws an error. This is very strange. Are you sure that this is the only command one needs to run to form the correct input for your script? I've tried to re-header the sorted file using the header from original BAM but with no success. I have samtools v. 1.9 and BAM files formed by CellRanger, a tool from 10x Genomics. CB tag is present in that BAM file

AidenSb · 2020-06-06T20:56:49Z

is is very strange. Are you sure that this is the only command one needs to run to form the correct input for your script? I've tried to re-header the sorted file using the header from original BAM but with no success. I have samtools v. 1.9 and BAM files formed by CellRanger, a tool f

Hi, @ivanov-v-v
I have the same problem, did you find out how to make it work?
Thanks

ak-watson · 2021-10-19T13:57:05Z

I was having this issue too (on 10X single cell data). I think that the issue is bamtools exceeding the limit on the number of open files (check 'ulimit -n'). Even if the expected number of cells is smaller than the file limit, the number of files created may be much larger due to sequencing errors creating many extra cellular barcodes.

This was the solution for me, thanks! I was able to get things working by adjusting my ulimit.

You can do that temporarily with:

ulimit -n X

Where x is your new limit.

duck-rong · 2022-10-14T02:08:20Z

Hi, about "samtools sort -t CB unsorted.bam > sorted_tags.bam"
I have the same problem, did you find out how to make it?
Thanks

xizhihui mentioned this issue Jun 28, 2021

Rewrite headers when write mode is append? mdshw5/simplesam#21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bamtools unable to write into split files #135

Bamtools unable to write into split files #135

asenabouth commented Oct 28, 2016

coutuc commented Nov 15, 2016

Flomass commented Feb 28, 2017

Hoohm commented Jun 23, 2017 •

edited

asenabouth commented Jun 24, 2017

Hoohm commented Jul 2, 2017

relyanow commented Aug 16, 2017

marong0511 commented Jan 1, 2018

marong0511 commented Jan 1, 2018

asenabouth commented Jan 12, 2018

jvhaarst commented Feb 16, 2018

asenabouth commented Feb 19, 2018

fluffy-critter commented Mar 14, 2019

herrinca commented Oct 9, 2019 •

edited

ivanov-v-v commented Apr 21, 2020 •

edited

AidenSb commented Jun 6, 2020

ak-watson commented Oct 19, 2021 •

edited

duck-rong commented Oct 14, 2022

Bamtools unable to write into split files #135

Bamtools unable to write into split files #135

Comments

asenabouth commented Oct 28, 2016

coutuc commented Nov 15, 2016

Flomass commented Feb 28, 2017

Hoohm commented Jun 23, 2017 • edited

asenabouth commented Jun 24, 2017

Hoohm commented Jul 2, 2017

relyanow commented Aug 16, 2017

marong0511 commented Jan 1, 2018

marong0511 commented Jan 1, 2018

asenabouth commented Jan 12, 2018

jvhaarst commented Feb 16, 2018

asenabouth commented Feb 19, 2018

fluffy-critter commented Mar 14, 2019

herrinca commented Oct 9, 2019 • edited

ivanov-v-v commented Apr 21, 2020 • edited

AidenSb commented Jun 6, 2020

ak-watson commented Oct 19, 2021 • edited

duck-rong commented Oct 14, 2022

Hoohm commented Jun 23, 2017 •

edited

herrinca commented Oct 9, 2019 •

edited

ivanov-v-v commented Apr 21, 2020 •

edited

ak-watson commented Oct 19, 2021 •

edited