Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf_to_zarr raises error instead of truncating INFO fields of length R #1210

Open
timothymillar opened this issue Mar 7, 2024 · 3 comments
Labels
bug Something isn't working IO Issues related to reading and writing common third-party file formats

Comments

@timothymillar
Copy link
Collaborator

In vcf_to_zarr, specifying max_alt_alleles as smaller than the actual maximum is meant to truncate the alleles dimension (with a warning). However, if there is an INFO field with length R (one value for each allele) then this results in an IndexError. Here is a minimal example:

example.vcf with 2 alternate alleles:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20240307
##source=None
##contig=<ID=chr1,length=21898217>
##INFO=<ID=AFP,Number=R,Type=Float,Description="Posterior mean allele frequencies">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE0
chr1    1       .       A       T,G     .       .       AFP=0.8,0.1,01  GT      0/1

convert.py:

from sgkit.io.vcf import vcf_to_zarr

vcf_to_zarr(
    input="example.vcf",
    output="example.zarr",
    max_alt_alleles=1,  # truncate to one alternate allele
    fields=[
        "INFO/AFP", 
        "FORMAT/GT", 
    ]
)
traceback:
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 vcf_to_zarr(
      2     input="example.vcf",
      3     output="example.zarr",
      4     max_alt_alleles=1,
      5     fields=[
      6         "INFO/AFP", 
      7         "FORMAT/GT", 
      8     ]
      9 )

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1002, in vcf_to_zarr(input, output, target_part_size, regions, chunk_length, chunk_width, compressor, encoding, temp_chunk_length, tempdir, tempdir_storage_options, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length, retain_temp_files)
    966 sequential_function = functools.partial(
    967     vcf_to_zarr_sequential,
    968     output=output,
   (...)
    980     field_defs=field_defs,
    981 )
    982 parallel_function = functools.partial(
    983     vcf_to_zarr_parallel,
    984     output=output,
   (...)
   1000     retain_temp_files=retain_temp_files,
   1001 )
-> 1002 process_vcfs(
   1003     input,
   1004     sequential_function,
   1005     parallel_function,
   1006     regions=regions,
   1007     target_part_size=target_part_size,
   1008 )
   1010 # Issue a warning if max_alt_alleles caused data to be dropped
   1011 ds = zarr.open(output)

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1328, in process_vcfs(input, sequential_function, parallel_function, regions, target_part_size)
   1320         regions = [
   1321             partition_into_regions(input, target_part_size=target_part_size)
   1322             for input in inputs
   1323         ]
   1325 if (isinstance(input, str) or isinstance(input, Path)) and (
   1326     regions is None or isinstance(regions, str)
   1327 ):
-> 1328     return sequential_function(input=input, region=regions)
   1329 else:
   1330     return parallel_function(input=input, regions=regions)

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:516, in vcf_to_zarr_sequential(input, output, region, chunk_length, chunk_width, compressor, encoding, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length)
    514         raise ValueError(f"Filter '{f}' is not defined in the header.")
    515     for field_handler in field_handlers:
--> 516         field_handler.add_variant(i, variant)
    518 # Truncate np arrays (if last chunk is smaller than read_chunk_length)
    519 if i + 1 < read_chunk_length:

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:293, in InfoAndFormatFieldHandler.add_variant(self, i, variant)
    291 try:
    292     for j, v in enumerate(val):
--> 293         self.array[i, j] = (
    294             v if v is not None else self.missing_value
    295         )
    296 except TypeError:  # val is a scalar
    297     self.array[i, 0] = val

IndexError: index 2 is out of bounds for axis 1 with size 2
@timothymillar timothymillar added bug Something isn't working IO Issues related to reading and writing common third-party file formats labels Mar 7, 2024
@timothymillar
Copy link
Collaborator Author

Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.

@jeromekelleher
Copy link
Collaborator

Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.

That's odd - the vcf parsing code doesn't use any numba AFAIK.

@timothymillar
Copy link
Collaborator Author

timothymillar commented Mar 7, 2024

With an updated dataset (slightly larger), including the length R FORMAT field resulted in ValueError: Codec does not support buffers of > 2147483647 bytes. This led me to zarr-developers/zarr-python#487 and with smaller chunk sizes I was able to convert the VCF. I'm not sure why it was crashing earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working IO Issues related to reading and writing common third-party file formats
Projects
None yet
Development

No branches or pull requests

2 participants