New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bamtools unable to write into split files #135
Comments
I am having exactly the same issue...following |
I'm also having exactly the same issue. Did you find a solution? |
Having the same issue with a dropseq pipeline. |
I still haven't figured out a solution, but I believe I have found a possible cause - it may be linked to the headers of the BAM files. I mainly have the issue with BAM files generated by 10x's Cell Ranger pipeline. I was able to split bam files from other sources. |
I'm using dropseq data from the McCarroll pipeline and getting the same problem. |
I was having this issue too (on 10X single cell data). I think that the issue is bamtools exceeding the limit on the number of open files (check 'ulimit -n'). Even if the expected number of cells is smaller than the file limit, the number of files created may be much larger due to sequencing errors creating many extra cellular barcodes. |
I am having this issue on 10x single cell data too. Do you know how to split the bam file from other sources? |
@asenabouth I have the same issue with you. Could you share the sources which you used to split the bam file? Thanks a lot! |
Sorry @marong0511 - I meant that I was able to split BAM files that were not generated with Cell Ranger. I really do think it's an issue with bamtools exceeding the number of open files. I haven't come across a solution to reliably split BAM files produced by Cell Ranger; instead I've been working with the whole file and selecting alignments by regions I was interested in using Python. You can access the cell barcodes via the "CB" tag (refer to pysam http://pysam.readthedocs.io/en/latest/usage.html). |
To write a valid BAM file, bamtools opens a file for each barcode, writes the header, and then starts adding the reads that match the barcode. Fixing this isn't easy, as for other users, the current way might work, and other ways to do this are probably a lot slower. As a workaround, I have written a very small python script, which will split the single BAM file into ~1M separate files, one for each barcode.
Usage : |
Hi @jvhaarst - nice work around, will give that a try when I have a chance. |
Out of curiosity, instead of holding the files open for writing, could they be opened for appending only when they need to be written to, perhaps with a "connection pool" of sorts which only holds on to, say, the |
Relatively quick option for spitting barcodes with <<1M unique barcodes. Script will split on any unique CB entry, but has only been tested with .bam files sorted with samtools on tag CB. To sort on CB tag:
Script uses pysam to read bam file without need for conversion, details for conda install of pysam can be found here.
After downloading script and changing name to split_script.py, change input file and output directory in script to desired locations and run:
or if still using python2
Runtime was about ~35sec for a test run ~1 million reads and ~5k unique barcodes. |
Dear @herrinca, thank you for posting this solution! While your Python-code is correct, I've found out that |
Hi, @ivanov-v-v |
This was the solution for me, thanks! I was able to get things working by adjusting my ulimit. You can do that temporarily with:
Where x is your new limit. |
|
I am trying to split a large BAM file into smaller bam files by tags that contain a cellular barcode. Bamtools is able to generate the files, but is unable to write the sequences into them.
I used the following command:
bamtools split -tag CB -in possorted_genome_bam_1.bam
It generated a bunch of empty bam files. This is an example of the filename it outputted:
possorted_genome_bam_1.TAG_CB_TTTGACTGCCCTCA-1.bam
It returned the following error message:
bamtool split ERROR: could not open possorted_merged_H32MNBGXY.TAG_CB_ACGATGACGATACC-1.bam for writing.
Is it possibly having trouble due to the characters in its file string? Thanks
The text was updated successfully, but these errors were encountered: