Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postprocess_variants: Found multiple file patterns in input filename space #818

Open
MiWitt opened this issue May 8, 2024 · 7 comments
Open
Assignees

Comments

@MiWitt
Copy link

MiWitt commented May 8, 2024

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6.1/docs/FAQ.md:

Describe the issue:
The postprocess_variants step fails with following error message:
ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

Setup

  • Operating system: CentOS Linux 7 (Core)
  • DeepVariant version: 1.6.1
  • Installation method (Docker, built from source, etc.): singularity
  • Type of data: PacBio Sequencing

Steps to reproduce:

  • Command:
  • Error trace:
    Traceback (most recent call last):
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1419, in
    app.run(main)
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/absl_py/absl/app.py", line 312, in run
    _run_main(main, args)
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/absl_py/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1300, in main
    sample_name = get_sample_name()
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1203, in get_sample_name
    _, record = get_cvo_paths_and_first_record()
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1179, in get_cvo_paths_and_first_record
    raise ValueError(
    ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

Does the quick start test work on your system?
Please test with https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-quick-start.md.
Is there any way to reproduce the issue by using the quick start?
???

Any additional context:
Yes. I can change the parameter "--infile" of the postprocess_variants.py call from "./call_variants_output.tfrecord.gz" to "./call_variants_output@1.tfrecord.gz" and it works. Anyway, the call of postprocess_variants.py is auto-generated by "/opt/deepvariant/bin/run_deepvariant". The error does not occur for every sample ...

directory content of intermediate_results_dir after the error occured:
call_variants.log
call_variants_output-00000-of-00001.tfrecord.gz
gvcf.tfrecord-00000-of-00008.gz
gvcf.tfrecord-00001-of-00008.gz
gvcf.tfrecord-00002-of-00008.gz
gvcf.tfrecord-00003-of-00008.gz
gvcf.tfrecord-00004-of-00008.gz
gvcf.tfrecord-00005-of-00008.gz
gvcf.tfrecord-00006-of-00008.gz
gvcf.tfrecord-00007-of-00008.gz
make_examples.log
make_examples.tfrecord-00000-of-00008.gz
make_examples.tfrecord-00000-of-00008.gz.example_info.json
make_examples.tfrecord-00001-of-00008.gz
make_examples.tfrecord-00001-of-00008.gz.example_info.json
make_examples.tfrecord-00002-of-00008.gz
make_examples.tfrecord-00002-of-00008.gz.example_info.json
make_examples.tfrecord-00003-of-00008.gz
make_examples.tfrecord-00003-of-00008.gz.example_info.json
make_examples.tfrecord-00004-of-00008.gz
make_examples.tfrecord-00004-of-00008.gz.example_info.json
make_examples.tfrecord-00005-of-00008.gz
make_examples.tfrecord-00005-of-00008.gz.example_info.json
make_examples.tfrecord-00006-of-00008.gz
make_examples.tfrecord-00006-of-00008.gz.example_info.json
make_examples.tfrecord-00007-of-00008.gz
make_examples.tfrecord-00007-of-00008.gz.example_info.json
postprocess_variants.log

@kishwarshafin
Copy link
Collaborator

@MiWitt , can you please send the full command here for each step? It seems like you have 8 files are you are setting @1?

@MiWitt
Copy link
Author

MiWitt commented May 10, 2024

I do not run it step by step. I run "run_deepvariant". This is my command:

 singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=PACBIO \
    --ref=${THEREF} \
    --reads="${ALIGNMENTNAME}.bam" \
    --sample_name=${SAMPLENAME} \
    --output_vcf="./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
    --output_gvcf="./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
    --intermediate_results_dir . \
    --num_shards=8 \
    --logging_dir=.

I have now added the following command, which is a workaround for the problem ...

    if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref="${THEREF}" \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi

Eventually this workaround sets --infile to "./call_variants_output@1.tfrecord.gz" and --nonvariant_site_tfrecord_path to "./gvcf.tfrecord@8.gz" (see directory listing above).

@MiWitt
Copy link
Author

MiWitt commented May 10, 2024

I could extract the three commands make_examples, call_variants and postprocess_variants from the output. Here it is:

seq 0 7 | parallel -q --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "stdchroms.hg38.fa" --reads "SAMPLENAME.bam" --examples "./make_examples.tfrecord@8.gz" --add_hp_channel --alt_aligned_pileup "diff_channels" --gvcf "./gvcf.tfrecord@8.gz" --max_reads_per_partition "600" --min_mapping_quality "1" --parse_sam_aux_fields --partition_size "25000" --phase_reads --pileup_image_width "199" --norealign_reads --sample_name "SAMPLENAME" --sort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels "0.12" --task {}

/opt/deepvariant/bin/call_variants --outfile "./call_variants_output.tfrecord.gz" --examples "./make_examples.tfrecord@8.gz" --checkpoint "/opt/models/pacbio"

/opt/deepvariant/bin/postprocess_variants --ref "stdchroms.hg38.fa" --infile "./call_variants_output.tfrecord.gz" --outfile "./SAMPLENAME.deepVariant.vcf.gz" --cpus "8" --gvcf_outfile "./SAMPLENAME.deepVariant.g.vcf.gz" --nonvariant_site_tfrecord_path "./gvcf.tfrecord@8.gz" --sample_name "SAMPLENAME"

And here are the two last commands with std out ...

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "./call_variants_output.tfrecord.gz" --examples "./make_examples.tfrecord@8.gz" --checkpoint "/opt/models/pacbio"

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
I0510 12:13:42.483308 47501039724352 call_variants.py:563] Total 1 writing processes started.
I0510 12:13:42.487790 47501039724352 dv_utils.py:370] From ./make_examples.tfrecord-00000-of-00008.gz.example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:42.487916 47501039724352 call_variants.py:588] Shape of input examples: [100, 199, 9]
I0510 12:13:42.488451 47501039724352 call_variants.py:592] Use saved model: True
I0510 12:13:52.162126 47501039724352 dv_utils.py:370] From /opt/models/pacbio/example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:52.163805 47501039724352 dv_utils.py:370] From ./make_examples.tfrecord-00000-of-00008.gz.example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:56.551032 47501039724352 call_variants.py:716] Predicted 982 examples in 1 batches [0.419 sec per 100].
I0510 12:13:57.403082 47501039724352 call_variants.py:779] Complete: call_variants.

real	0m21.581s
user	1m40.583s
sys	0m15.744s

***** Running the command:*****
time /opt/deepvariant/bin/postprocess_variants --ref "stdchroms.hg38.fa" --infile "./call_variants_output.tfrecord.gz" --outfile "./SAMPLENAME.deepVariant.vcf.gz" --cpus "8" --gvcf_outfile "./SAMPLENAME.deepVariant.g.vcf.gz" --nonvariant_site_tfrecord_path "./gvcf.tfrecord@8.gz" --sample_name "SAMPLENAME"

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1419, in <module>
    app.run(main)
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/absl_py/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/absl_py/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1300, in main
    sample_name = get_sample_name()
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1203, in get_sample_name
    _, record = get_cvo_paths_and_first_record()
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1179, in get_cvo_paths_and_first_record
    raise ValueError(
ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

real	0m4.925s
user	0m8.815s
sys	0m7.379s

@kishwarshafin
Copy link
Collaborator

@MiWitt ,

Given that you are using --intermediate_results_dir . \ which writes all intermediate files to your directory, if you run the same command multiple times then it will create multiple patterns. Can you please create a clean intermediate directory and use that for --intermediate_results_dir /path/to/intermediate_dir? That should resolve the issue.

@MiWitt
Copy link
Author

MiWitt commented May 13, 2024

This can not be the point. I am working in a cluster environment using slurm and the dir "." is the job specific scratch dir, which is located at "/scratch/SlurmTMP/JobSpecificFolder" (${TMPDIR})


cd ${TMPDIR}
BIN_VERSION="1.6.1"
module load singularity/3.5.2


#####################################################################
# singularity pull docker://google/deepvariant:"${BIN_VERSION}"


ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150

#  --model_type=PACBIO \ ##Replace this string with exactly one of the following [WGS,WES,PACBIO,HYBRID_PACBIO_ILLUMINA]**
#  docker://google/deepvariant:"${BIN_VERSION}" \

if ! [ -f "${WORKINDIR}/${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
then
  cp "${THEREF}"* ./
  cp "${WORKINDIR}/${ALIGNMENTNAME}.bam"* .
  chmod 666 `basename "${THEREF}"`*
  chmod 666 "${ALIGNMENTNAME}.bam"*
  singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=PACBIO \
    --ref=`basename "${THEREF}"` \
    --reads="${ALIGNMENTNAME}.bam" \
    --sample_name=${SAMPLENAME} \
    --output_vcf="./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
    --output_gvcf="./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
    --intermediate_results_dir . \
    --num_shards=8 \
    --logging_dir=.
    
    if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref=`basename "${THEREF}"` \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi
    cp *.log ${WORKINDIR}/
    cp "./${ALIGNMENTNAME}.deepVariant.vcf.gz"* ${WORKINDIR}/
else
 cp "${WORKINDIR}/${ALIGNMENTNAME}.deepVariant.vcf.gz"* .
fi

@kishwarshafin
Copy link
Collaborator

@MiWitt ,

Can you use --intermediate_results_dir ./intermediate_results_ ${ALIGNMENTNAME}. I am unsure why you are running postprocessing separately, but, something must be overwriting the files or generating multiple file patterns in the same directory where you are saving everything. One way to better debug is to set --dry_run=true for each command and look at the outputs and see if they match with each other. Unfortunately I don't have access to an HPC to replicate this issue. I tried running your script but it has many missing variables.

@kishwarshafin
Copy link
Collaborator

@MiWitt

Hi, do you have any updates on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants