Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread error running bamgineer #10

Open
pcgen1 opened this issue Apr 12, 2019 · 41 comments
Open

Thread error running bamgineer #10

pcgen1 opened this issue Apr 12, 2019 · 41 comments

Comments

@pcgen1
Copy link

pcgen1 commented Apr 12, 2019

Getting error running the bamgineer tool. Seems to be with respect to the multiprocessing module. I also tried to use the older version of multiprocessing module ( (0.70.4, as suggested on online forums for such a python error; seems to be a common error). Still no luck in getting bamgineer to work through it. Could you suggest a solution to it?
Please find the error log below:

___ generating phased bed ___
___ filtering bed file columns for amp4AABB47974300_tmp2.bed ___
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/mnt/DataDisk/NGS_tools/bamgineer/src/helpers/handlers.py", line 76, in receive
record = self.queue.get(True, self.polltime)
File "/usr/lib/python2.7/multiprocessing/queues.py", line 135, in get
res = self._recv()
TypeError: init() takes exactly 2 arguments (1 given)

@suluxan
Copy link
Collaborator

suluxan commented Apr 12, 2019

Hey,
Could you post your config.cfg file? Are you working locally or on a cluster as there is a Dockerfile to get up and running.

@pcgen1
Copy link
Author

pcgen1 commented Apr 12, 2019

Please find the config.cfg pasted below. Am working on a cloud instance. Not using docker currently for this tool as it gave separate issues earlier (hard to describe all here). Running bamgineer locally on the cloud instance (have installed all dependencies locally).

[SOFTWARE]
java =/usr/bin/java
gatk =/mnt/DataDisk/Bamgineer/Jar/GenomeAnalysisTK.jar
java_path =/usr/bin/java
beagle_path =/mnt/DataDisk/Bamgineer/beagle/beagle.28Sep18.793.jar
samtools_path =/usr/local/bin/samtools
vcftools_path =/usr/local/bin/vcftools
bedtools_path =/usr/local/bin/bedtools
sambamba_path =/usr/local/bin/sambamba
picard_path =/mnt/DataDisk/Bamgineer/Jar/picard.jar

[REFERENCE]
reference_path =/mnt/DataDisk/Bamgineer/human_g1k_v37_decoy.chr.fasta
vcf_path =/mnt/DataDisk/VCFs/variants_haplotype_caller_C12878W.noIndels.vcf.recode.HET.noX_Y.phased.chr.vcf.gz
exons_path =/mnt/DataDisk/Resources/Beds/Regions.chr21.bed

[RESULTS]
results_path =/mnt/DataDisk/Bamgineer

@suluxan
Copy link
Collaborator

suluxan commented Apr 12, 2019

What version of the multiprocessing package gave you the error? The version of multiprocessing we have on our cluster is 0.70a1. From the documentations it looks like the latest version (0.70.7) is a fork of 0.70a1 (https://pypi.org/project/multiprocess/0.70.7)

@suluxan
Copy link
Collaborator

suluxan commented Apr 12, 2019

Hey, I've updated multiprocessing to now use multiprocess 0.70.7 (pip install multiprocess==0.70.7). Please pull from the latest version of bamgineer and let me know if you have any issues with it.

@pcgen1
Copy link
Author

pcgen1 commented Apr 12, 2019

Sure. Would let you know. Thank you!

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

Hi suluxan, The program is running fine now, but has been running for almost 3 hours with a small bam (input) containing only chr21 and chr22 regions, a 'splitbam' directory containing chr21.bam , chr21.byname.bam, chr22.bam and chr22.byname.bam, AND a cnv file containing only 1 amp (cn=4) for 1 region of chr21. The script goes upto the step of creating a chr21_roiamp4AABB47974300.bam file under "tmpbams" but seems to be taking a good amount of time for creating the final simulated bam. Could you help to see if something is going wrong here. Please see the command-line, bed and logs pasted below. The config file is same as posted in this thread earlier. (Note: I do give a phased vcf consisting only chr21 phased variants to bamgineer. The phased vcf was created by running the beagle tool ahead of running bamgineer {due to some issues we faced earlier while running beagle as a part of bamgineer workflow earlier; not necessary to discuss at the moment} )

Command line:
simulate.py -inbam ~/DataDisk/VCFs/C12878W.21_and_22.bam -outbam ~/DataDisk/Bamgineer/C12878W.21_and_22.bamgineer.bam -cnv_bed ~/DataDisk/VCFs/cnv_of_interest.bed -config ~/DataDisk/Bamgineer/config.cfg -splitbamdir ~/DataDisk/VCFs/splitBam > ~/DataDisk/Bamgineer/C12878W.21_22.bamgineer.log 2>&1 &

cnv bed file:
chr21 47974300 47974590 AABB 4

Logs:
a) C12878W.21_22.bamgineer.log
/mnt/DataDisk/Bamgineer
___ generating phased bed ___
___ filtering bed file columns for amp4AABB47974300_tmp2.bed ___

b)debug.log
pipeline started!
--- Initializing input files ---
--- initialization complete ---

Do you expect simulate.py to take this much time with such a small bam? If "yes", then, does a multithread parameter exist which could make simulate.py run faster on a single instance? I did not see such parameter in the "help" section.

@suluxan
Copy link
Collaborator

suluxan commented Apr 15, 2019

Yeah, the previous steps of Bamgineer v1 for phasing were not that clear; it seems Beagle needs population data to phase correctly. Is that how you generated your VCF? I have been running Bamgineer v2 with properly phased VCFs (from 10x) and was working on a change to make it much faster (to only use "PASS" variants) but I am working on the benchmarking. I will push that change now and you can let me know if it helps.

It should not take that long to get the ROI bam especially considering how small the cnv is. Although bamgineer v2 is capable of such focal alterations, I would recommend a couple Kb in order to get a decent amount of reads in the ROI bam.

@suluxan
Copy link
Collaborator

suluxan commented Apr 15, 2019

Regarding the multithreading comment, once we update the pysam/samtools versions we will be able to take advantage of the multithreading. The current samtools version that we use (1.2) does not support multithreading.

Also, try pulling from the latest version now and let me know if you have the same problem.

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

suluxan, I am getting ROI bam (in tmpbam folder) no problem, but not getting the final bam in the finalbam folder. I believe that the finalbam folder would contain the bam simulated with the CNVs...am I right?

@suluxan
Copy link
Collaborator

suluxan commented Apr 15, 2019

What are the other files in the tmpbams directory?

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

Just one: chr21_roiamp4AABB47974300.bam

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

That tmp bam has around 282 reads

@suluxan
Copy link
Collaborator

suluxan commented Apr 15, 2019

Try the latest version I just pushed, the ROI should generate much faster.

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

Suluxan, ROI is indeed getting generated faster. The problem is that the python script is still running. And I see no final bam generated in finalbam folder. I believe that final bam should be the one that actually contains the simulated cnv... Am I right?

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

Let me try the new version anyways..

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

Suluxan, the new version generates the ROI bam at the same speed as the previous version, but gives back the multiprocessing module error which you already fixed last Friday. And moreover, the issue still remains: The script is still running and the final bam not being generated.

See the throwback for the multiprocessing error below:
/mnt/DataDisk/Bamgineer
___ generating phased bed ___
___ filtering bed file columns for amp4AABB47974300_tmp2.bed ___
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/mnt/DataDisk/NGS_tools/bamgineer/src/helpers/handlers.py", line 76, in receive
record = self.queue.get(True, self.polltime)
File "/usr/local/lib/python2.7/dist-packages/multiprocess-0.70.7-py2.7-linux-x86_64.egg/multiprocess/queues.py", line 138, in get
res = self._recv()
File "/home/ubuntu/.local/lib/python2.7/site-packages/dill/dill.py", line 299, in loads
return load(file)
File "/home/ubuntu/.local/lib/python2.7/site-packages/dill/dill.py", line 288, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
TypeError: init() takes exactly 2 arguments (1 given)

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

One more point to add is: The samtools version that bamgineer using on my instance is 0.1.18. This was kept consistent with what you mentioned in the example config file. Do you think updating that to 1.2 might make a difference speedwise? (Given that you already mentioned that 1.2 is slow). Regardless, I think I should be consistent with 1.2 version just to compare apples to apples...let me do that (not with the new version of bamgineer but the old version {because of the mutliprocessing issue that I just mentioned a min ago}....

@pcgen1
Copy link
Author

pcgen1 commented Apr 15, 2019

Ok, so rerunning the bamgineer with samtools 1.2 version. The ROI bam got generated in the "tmpbams" directory in a second. The script is still running. I would wait to see if it generates the final bam by the end of the day.
Pls note that the input bam still contains only two chromososmes (21 and 22) and the cnv bed has only one region from chr21 in it (same as before). And the phased vcf contains only chr21 variants (indeed passed ones) (same as before). I have -splitbams directory option activated (same as before) (note: splibam dir contains bams for chr21 and chr22; naming of these bams matches what you specified in your manual)
Also, note that this is the version of bamgineer I pulled Last friday after you fixed the multiprocessing-module-error.

@suluxan
Copy link
Collaborator

suluxan commented Apr 15, 2019

Okay, couple of things:

  • samtools/1.2 and pysam/0.8.4 (pip install pysam==0.8.4) are necessary for bamgineer and they go hand-in-hand which could explain why the haplotype bams are not being generated in your runs ***
  • the finalbams folder only gets the final bam once the haplotype bams are generated and amplified/deleted (so there needs to a whole bunch of temporary bam files in the tmpbams)
  • if your vcfs are phased beforehand please omit the "-phase" option from your script ***
  • I noticed you don't have bamutils in your config file (please see bamgineer/docker-example/inputs/config.cfg for the updated config and the Dockerfile for a link to download) ***
  • the reason the timings are the same is because you requested a very tiny window which should not take much time at all to generate, with a larger ROI the non-passing SNPs are filtered out which makes the ROI generation faster
  • see the dockerfile (bamgineer/docker-example/Dockerfile) for exact versions; really only samtools and pysam are hard version requirements

I presume the tool stops running due to the starred points above.

@suluxan
Copy link
Collaborator

suluxan commented Apr 15, 2019

A lot of the dependency issues were supposed to be solved through the Dockerfile... any reasons why it initially failed? Considering you are on a cloud environment it would be the optimal route to go.

Also, I will get a bamgineer image to the dockerhub by tonight or tomorrow so it will be easy to just pull from there.

@pcgen1 pcgen1 closed this as completed Apr 16, 2019
@pcgen1 pcgen1 reopened this Apr 16, 2019
@pcgen1
Copy link
Author

pcgen1 commented Apr 16, 2019

May I know which version of bamUtil do you prefer?

@suluxan
Copy link
Collaborator

suluxan commented Apr 17, 2019

Please check the dockerfile (bamgineer/docker-example/Dockerfile) for install instructions and versions. We have tested bamgineer with bamUtil/1.0.14. I am working on getting the image to a docker repo as well as updating the documentation and I will let you know when those are available. Thanks.

@pcgen1
Copy link
Author

pcgen1 commented Apr 17, 2019

Ok, I was looking into config.cfg under bamgineer/docker-example/inputs folder for versioning info. Thanks for correcting me. I am indeed using bamUtil/1.0.14. Glad to know you recommend the same. At this point, I have all versions of all tools setup appropriately on my cloud instance. I would also test the docker container once you have it up on docker repo. We use singularity engine on our instance. So, would need to convert your docker container to singularity. That is what I did earlier too, but the issue I faced with your docker container seemed to be less related to its compability with singularity and more related to the internal (default) environment in the container itself. Singularity does not make significant changes to the default environment in the docker containers (based on my experience converting docker containers to singularity ones and using them with singularity engine). They usually run well through singularity engine.
Let's see how the new docker container (that you would be uploading soon to docker repo) performs. Please let me know when ready.

@pcgen1
Copy link
Author

pcgen1 commented Apr 17, 2019

FYI: I was using a docker container from this account earlier : https://hub.docker.com/r/virenar/bamgineer. Doesnt look like your account..

@pcgen1
Copy link
Author

pcgen1 commented Apr 17, 2019

Suluxan, does bamgineer delete the tmp bams in tmpbam dir after the execution completes? I see that the execution completed (no python script running under "top" output), but there is no finalbam generated. Also , does the bedtool.log gets deleted as well?
FYI I updated my cnv.bed, exons.bed and phased vcfs to include entries for two chromosomes, i,e. 21 and 22, instead of just one (i.e. 21). I see the script execution ended with following lines in the log file but no final bam created.

/mnt/DataDisk/Bamgineer
___ generating phased bed ___
___ filtering bed file columns for amp4AABB47974300_tmp2.bed ___
___ filtering bed file columns for gainAAB18300720_tmp2.bed ___

Please note that am using -splitBamDir option with the following files in my splitBam dir:
chr21.bam chr21.bam.bai chr21.byname.bam chr22.bam chr22.bam.bai chr22.byname.bam

@pcgen1
Copy link
Author

pcgen1 commented Apr 17, 2019

Which version of pathos and pandas you recommend? Not clear from the DockerFile.
Also, is specifying -cancertype necessary? Could I mention -cancertype as None..?
Currently, am not at all using -cancertype argument on the command-line. We focus on germline analysis currently.

@suluxan
Copy link
Collaborator

suluxan commented Apr 18, 2019

That docker container was not from us. It is from a user.
I did not remove tmpbams for debugging purposes.
Your output should look something like this:
___ generating phased bed ___
___ filtering bed file columns for amp4AAAB30227447_tmp2.bed ___
___ extracting roi bams ___
___ splitting original bam into hap1 and hap2 ___
___ re-pairing hap1 bam reads ___
___ removing repaired duplicates ___
___ re-pairing hap2 bam reads ___
___ extracting non-roi bams ___
___ removing repaired duplicates ___
___ removing hap1 merged normal duplicates ___
___ removing hap2 merged normal duplicates ___
___ removing merged duplicates near breakpoints ___

I am updating the documentation, pathos is no longer necessary since we have updated multiprocessing to multiprocess. For pandas, I am on 0.20.2 but it should not matter. The image should solve all dependency issues.

The "-cancertype" is not necessary, it just organizes the output bam directories into a cancer type directory.

@pcgen1
Copy link
Author

pcgen1 commented Apr 18, 2019

Thanks so much suluxan. I would try out the container..:)

@pcgen1
Copy link
Author

pcgen1 commented Apr 18, 2019

Couldnt find your docker image on dockerhub. Sorry I thought you had already uploaded. Any estimated ETA that you could give would be great.

@suluxan
Copy link
Collaborator

suluxan commented Apr 18, 2019

Ah sorry I have been working on other things. At the latest I will have it up for you by tomorrow. Will let you know as soon as I do; thanks!

@pcgen1
Copy link
Author

pcgen1 commented Apr 18, 2019

Thanks!

@pcgen1
Copy link
Author

pcgen1 commented Apr 18, 2019

Hi, suluxan, could you paste the command line from your most recent bamgineer run?
Thanks.

@suluxan
Copy link
Collaborator

suluxan commented Apr 19, 2019

The docker image is available at suluxan/bamgineer. You can use singularity to build it with singularity build bamgineer.simg docker://suluxan/bamgineer:initial

The tools in the configfile in bamgineer/docker-example/inputs are linked to the image itself so they require no changes. Just mount or move your files into the container and point to them in the config file and the python script and run!

@pcgen1
Copy link
Author

pcgen1 commented Apr 22, 2019

Sure, thanks suluxan!

@pcgen1
Copy link
Author

pcgen1 commented Apr 23, 2019

It worked suluxan! Thank you
Note: Use --sandbox on singularity build command line to be able to do modifications to the file inside the image/container directory (for example: the config that you are talking about). I always do that when I want to see what is entailed inside a container and/or modify configs therein like yours.

@pcgen1
Copy link
Author

pcgen1 commented Apr 25, 2019

Hi suluxan, Does the exon bed needs to be 0-based or 1-based?

@suluxan
Copy link
Collaborator

suluxan commented Apr 26, 2019

1-based but it is a whole genome start and end coordinate i.e. chr21 1 48129895 for hg19. The exons.bed name convention was kept from the previous version.

@pcgen1
Copy link
Author

pcgen1 commented Apr 26, 2019

awesome..thanks!

@pcgen1
Copy link
Author

pcgen1 commented May 9, 2019

Hi suluxan, It seems bamgineer requires "chr" text for chromosome names in bams, beds, etc.. For example: if chromosomes are named just "1","2","3",etc. , bamgineer would not move forward. Could you fix that for us so that we do not have to worry about converting our bams to match with "chr" naming convention. We use gatk-broad/ncbi reference genome in our pipeline as opposed to ucsc ones, so do not have "chr" text prefixed to our chromosome names. And converting the bams later to match with those chromosome names is a pain in neck.

@suluxan
Copy link
Collaborator

suluxan commented Jun 10, 2019

Hey,
Sorry for getting back so late, I've been away the past few weeks.
I believe this has to do with pysam because there aren't any restrictions in the bamgineer code to using the chr naming conventions. I will investigate some more and get back to you about this. Thanks.

@pcgen1
Copy link
Author

pcgen1 commented Jun 10, 2019

Thanks. Please let me know..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants