Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bamtools unable to write into split files #135

Open
asenabouth opened this issue Oct 28, 2016 · 17 comments
Open

Bamtools unable to write into split files #135

asenabouth opened this issue Oct 28, 2016 · 17 comments

Comments

@asenabouth
Copy link

I am trying to split a large BAM file into smaller bam files by tags that contain a cellular barcode. Bamtools is able to generate the files, but is unable to write the sequences into them.

I used the following command:
bamtools split -tag CB -in possorted_genome_bam_1.bam

It generated a bunch of empty bam files. This is an example of the filename it outputted:
possorted_genome_bam_1.TAG_CB_TTTGACTGCCCTCA-1.bam

It returned the following error message:

bamtool split ERROR: could not open possorted_merged_H32MNBGXY.TAG_CB_ACGATGACGATACC-1.bam for writing.

Is it possibly having trouble due to the characters in its file string? Thanks

@coutuc
Copy link

coutuc commented Nov 15, 2016

I am having exactly the same issue...following

@Flomass
Copy link

Flomass commented Feb 28, 2017

I'm also having exactly the same issue. Did you find a solution?
I found a really dumb way to do the same thing converting to sam files and using grep, but it surely takes much longer.
Any suggestion?
Thanks in advance

@Hoohm
Copy link

Hoohm commented Jun 23, 2017

Having the same issue with a dropseq pipeline.

@asenabouth
Copy link
Author

I still haven't figured out a solution, but I believe I have found a possible cause - it may be linked to the headers of the BAM files. I mainly have the issue with BAM files generated by 10x's Cell Ranger pipeline. I was able to split bam files from other sources.

@Hoohm
Copy link

Hoohm commented Jul 2, 2017

I'm using dropseq data from the McCarroll pipeline and getting the same problem.
Haven't been able to figure out why either :(

@relyanow
Copy link

I was having this issue too (on 10X single cell data). I think that the issue is bamtools exceeding the limit on the number of open files (check 'ulimit -n'). Even if the expected number of cells is smaller than the file limit, the number of files created may be much larger due to sequencing errors creating many extra cellular barcodes.

@marong0511
Copy link

I am having this issue on 10x single cell data too. Do you know how to split the bam file from other sources?

@marong0511
Copy link

@asenabouth I have the same issue with you. Could you share the sources which you used to split the bam file? Thanks a lot!

@asenabouth
Copy link
Author

Sorry @marong0511 - I meant that I was able to split BAM files that were not generated with Cell Ranger. I really do think it's an issue with bamtools exceeding the number of open files. I haven't come across a solution to reliably split BAM files produced by Cell Ranger; instead I've been working with the whole file and selecting alignments by regions I was interested in using Python. You can access the cell barcodes via the "CB" tag (refer to pysam http://pysam.readthedocs.io/en/latest/usage.html).

@jvhaarst
Copy link

To write a valid BAM file, bamtools opens a file for each barcode, writes the header, and then starts adding the reads that match the barcode.
The problem is that keeping a file open costs resources, so the operating systems limits the amount of files that can be open at the same time.
As the BAM file with 10X data can easily hold a million different barcodes, the above method fails fast.

Fixing this isn't easy, as for other users, the current way might work, and other ways to do this are probably a lot slower.

As a workaround, I have written a very small python script, which will split the single BAM file into ~1M separate files, one for each barcode.
As the filesystem doesn't like 1M files in a single folder, it creates subfolders :

import os
import fileinput
for line in fileinput.input():
    line=line.strip()
    tags = line.split()[11:]
    for tag in tags:
        tag_split=tag.split(':')
        if 'BX' in tag_split:
            barcode=tag_split[-1]
            directory = barcode[:5]
            outfile = directory+"/"+barcode
            try:
                f=open(outfile,"a+")
                print(line, file=f)
                f.close()
            except:
                os.mkdir(directory)
                f=open(outfile,"a+")
                print(line, file=f)
                f.close()

Usage :
samtools view BAM-file | python split_script.py
(this will take a while)
After that the header needs to be added to each file to be able to use it.

@asenabouth
Copy link
Author

Hi @jvhaarst - nice work around, will give that a try when I have a chance.

@fluffy-critter
Copy link

Out of curiosity, instead of holding the files open for writing, could they be opened for appending only when they need to be written to, perhaps with a "connection pool" of sorts which only holds on to, say, the getrlimit(RLIMIT_NOFILE)/2 most recently accessed files? This would allow bamtools to still be much more efficient while not exceeding the system resource limits.

@herrinca
Copy link

herrinca commented Oct 9, 2019

Relatively quick option for spitting barcodes with <<1M unique barcodes. Script will split on any unique CB entry, but has only been tested with .bam files sorted with samtools on tag CB. To sort on CB tag:

samtools sort -t CB unsorted.bam > sorted_tags.bam

Script uses pysam to read bam file without need for conversion, details for conda install of pysam can be found here.

##### Code has not been tested on unsorted bam files, sort on barcode (CB):
##### samtools sort -t CB unsorted.bam > sorted_tags.bam
###
##### INPUT: .bam file to be sorted and output directory to place split BC
##### OUTPUT: .bam file for each unique barcode, best to make a new directory

### Python 3.6.8
import pysam

### Input varibles to set
# file to split on
unsplit_file = "/path/to/dir/of/data/sorted_tags.bam"
# where to place output files
out_dir = "/path/to/dir/of/out_data/"

# variable to hold barcode index
CB_hold = 'unset'
itr = 0
# read in upsplit file and loop reads by line
samfile = pysam.AlignmentFile( unsplit_file, "rb")
for read in samfile.fetch( until_eof=True):
    # barcode itr for current read
    CB_itr = read.get_tag( 'CB')
    # if change in barcode or first line; open new file  
    if( CB_itr!=CB_hold or itr==0):
        # close previous split file, only if not first read in file
        if( itr!=0):
            split_file.close()
        CB_hold = CB_itr
        itr+=1
        split_file = pysam.AlignmentFile( out_dir + "CB_{}.bam".format( itr), "wb", template=samfile)

    # write read with same barcode to file
    split_file.write( read)
split_file.close()
samfile.close()

After downloading script and changing name to split_script.py, change input file and output directory in script to desired locations and run:

python split_script.py

or if still using python2

python3 split_script.py

Runtime was about ~35sec for a test run ~1 million reads and ~5k unique barcodes.

@ivanov-v-v
Copy link

ivanov-v-v commented Apr 21, 2020

Relatively quick option for spitting barcodes with <<1M unique barcodes. Script will split on any unique CB entry, but has only been tested with .bam files sorted with samtools on tag CB. To sort on CB tag:

samtools sort -t CB unsorted.bam > sorted_tags.bam

Script uses pysam to read bam file without need for conversion, details for conda install of pysam can be found here.

##### Code has not been tested on unsorted bam files, sort on barcode (CB):
##### samtools sort -t CB unsorted.bam > sorted_tags.bam

Dear @herrinca, thank you for posting this solution! While your Python-code is correct, I've found out that samtools sort -t CB removes CB tag from file: CB_itr = read.get_tag( 'CB') line throws an error. This is very strange. Are you sure that this is the only command one needs to run to form the correct input for your script? I've tried to re-header the sorted file using the header from original BAM but with no success. I have samtools v. 1.9 and BAM files formed by CellRanger, a tool from 10x Genomics. CB tag is present in that BAM file

image

@AidenSb
Copy link

AidenSb commented Jun 6, 2020

is is very strange. Are you sure that this is the only command one needs to run to form the correct input for your script? I've tried to re-header the sorted file using the header from original BAM but with no success. I have samtools v. 1.9 and BAM files formed by CellRanger, a tool f

Hi, @ivanov-v-v
I have the same problem, did you find out how to make it work?
Thanks

@ak-watson
Copy link

ak-watson commented Oct 19, 2021

I was having this issue too (on 10X single cell data). I think that the issue is bamtools exceeding the limit on the number of open files (check 'ulimit -n'). Even if the expected number of cells is smaller than the file limit, the number of files created may be much larger due to sequencing errors creating many extra cellular barcodes.

This was the solution for me, thanks! I was able to get things working by adjusting my ulimit.

You can do that temporarily with:

ulimit -n X

Where x is your new limit.

@duck-rong
Copy link

Hi, about "samtools sort -t CB unsorted.bam > sorted_tags.bam"
I have the same problem, did you find out how to make it?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests