`vcf_to_zarr` raises error instead of truncating INFO fields of length `R` #1210

timothymillar · 2024-03-07T01:43:43Z

In vcf_to_zarr, specifying max_alt_alleles as smaller than the actual maximum is meant to truncate the alleles dimension (with a warning). However, if there is an INFO field with length R (one value for each allele) then this results in an IndexError. Here is a minimal example:

example.vcf with 2 alternate alleles:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20240307
##source=None
##contig=<ID=chr1,length=21898217>
##INFO=<ID=AFP,Number=R,Type=Float,Description="Posterior mean allele frequencies">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE0
chr1    1       .       A       T,G     .       .       AFP=0.8,0.1,01  GT      0/1

convert.py:

from sgkit.io.vcf import vcf_to_zarr

vcf_to_zarr(
    input="example.vcf",
    output="example.zarr",
    max_alt_alleles=1,  # truncate to one alternate allele
    fields=[
        "INFO/AFP", 
        "FORMAT/GT", 
    ]
)

traceback:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 vcf_to_zarr(
      2     input="example.vcf",
      3     output="example.zarr",
      4     max_alt_alleles=1,
      5     fields=[
      6         "INFO/AFP", 
      7         "FORMAT/GT", 
      8     ]
      9 )

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1002, in vcf_to_zarr(input, output, target_part_size, regions, chunk_length, chunk_width, compressor, encoding, temp_chunk_length, tempdir, tempdir_storage_options, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length, retain_temp_files)
    966 sequential_function = functools.partial(
    967     vcf_to_zarr_sequential,
    968     output=output,
   (...)
    980     field_defs=field_defs,
    981 )
    982 parallel_function = functools.partial(
    983     vcf_to_zarr_parallel,
    984     output=output,
   (...)
   1000     retain_temp_files=retain_temp_files,
   1001 )
-> 1002 process_vcfs(
   1003     input,
   1004     sequential_function,
   1005     parallel_function,
   1006     regions=regions,
   1007     target_part_size=target_part_size,
   1008 )
   1010 # Issue a warning if max_alt_alleles caused data to be dropped
   1011 ds = zarr.open(output)

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1328, in process_vcfs(input, sequential_function, parallel_function, regions, target_part_size)
   1320         regions = [
   1321             partition_into_regions(input, target_part_size=target_part_size)
   1322             for input in inputs
   1323         ]
   1325 if (isinstance(input, str) or isinstance(input, Path)) and (
   1326     regions is None or isinstance(regions, str)
   1327 ):
-> 1328     return sequential_function(input=input, region=regions)
   1329 else:
   1330     return parallel_function(input=input, regions=regions)

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:516, in vcf_to_zarr_sequential(input, output, region, chunk_length, chunk_width, compressor, encoding, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length)
    514         raise ValueError(f"Filter '{f}' is not defined in the header.")
    515     for field_handler in field_handlers:
--> 516         field_handler.add_variant(i, variant)
    518 # Truncate np arrays (if last chunk is smaller than read_chunk_length)
    519 if i + 1 < read_chunk_length:

File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:293, in InfoAndFormatFieldHandler.add_variant(self, i, variant)
    291 try:
    292     for j, v in enumerate(val):
--> 293         self.array[i, j] = (
    294             v if v is not None else self.missing_value
    295         )
    296 except TypeError:  # val is a scalar
    297     self.array[i, 0] = val

IndexError: index 2 is out of bounds for axis 1 with size 2

The text was updated successfully, but these errors were encountered:

timothymillar · 2024-03-07T02:30:49Z

Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.

jeromekelleher · 2024-03-07T09:39:27Z

Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.

That's odd - the vcf parsing code doesn't use any numba AFAIK.

timothymillar · 2024-03-07T23:12:52Z

With an updated dataset (slightly larger), including the length R FORMAT field resulted in ValueError: Codec does not support buffers of > 2147483647 bytes. This led me to zarr-developers/zarr-python#487 and with smaller chunk sizes I was able to convert the VCF. I'm not sure why it was crashing earlier.

timothymillar added bug Something isn't working IO Issues related to reading and writing common third-party file formats labels Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vcf_to_zarr` raises error instead of truncating INFO fields of length `R` #1210

`vcf_to_zarr` raises error instead of truncating INFO fields of length `R` #1210

timothymillar commented Mar 7, 2024

timothymillar commented Mar 7, 2024

jeromekelleher commented Mar 7, 2024

timothymillar commented Mar 7, 2024 •

edited

vcf_to_zarr raises error instead of truncating INFO fields of length R #1210

vcf_to_zarr raises error instead of truncating INFO fields of length R #1210

Comments

timothymillar commented Mar 7, 2024

timothymillar commented Mar 7, 2024

jeromekelleher commented Mar 7, 2024

timothymillar commented Mar 7, 2024 • edited

`vcf_to_zarr` raises error instead of truncating INFO fields of length `R` #1210

`vcf_to_zarr` raises error instead of truncating INFO fields of length `R` #1210

timothymillar commented Mar 7, 2024 •

edited