Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range #202

Open
weishwu opened this issue Feb 16, 2024 · 4 comments
Open

IndexError: list index out of range #202

weishwu opened this issue Feb 16, 2024 · 4 comments

Comments

@weishwu
Copy link

weishwu commented Feb 16, 2024

My commdand-lines:

# NanoSim v3.1.0

singularity run -B /nfs/ nanosim_v3.1.0.sif read_analysis.py genome \
   -i all.pass.filt.fastq \
   -rg 741_wzi_types.fa \
   -o training \
   --chimeric \
   -t 40

singularity run -B /nfs/ nanosim_v3.1.0.sif simulator.py genome \
   -rg 741_wzi_types.fa \
   -c training \
   -n 20000 \
   -dna_type linear \
   -max 400 \
   --basecaller guppy \
   --fastq \
   -t 40

log:

running the code with following parameters:

ref_g ../../ref_data/741_wzi_types.fa
model_prefix training
out simulated
number [20000]
perfect False
kmer_bias None
basecaller guppy
dna_type linear
strandness None
sd_len None
median_len None
max_len 400
min_len 50
fastq True
chimeric False
num_threads 40
2024-02-16 00:47:30: /usr/local/bin/simulator.py genome -rg ../../ref_data/741_wzi_types.fa -c training -n 20000 -dna_type linear -max 400 --basecaller guppy --fastq -t 40
2024-02-16 00:47:30: Read in reference 
2024-02-16 00:47:30: Read error profile
2024-02-16 00:47:30: Read KDF of unaligned reads
2024-02-16 00:47:31: Read KDF of aligned reads
2024-02-16 00:47:31: Read chimeric simulation information
2024-02-16 00:47:31: Start simulation of aligned reads
Process Process-26:: Number of reads simulated >> 10001
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/bin/simulator.py", line 1294, in simulation_aligned_genome
    head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range
Process Process-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/bin/simulator.py", line 1294, in simulation_aligned_genome
    head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range
Process Process-36:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/bin/simulator.py", line 1294, in simulation_aligned_genome
    head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range

2024-02-16 00:47:40: Start simulation of random reads

2024-02-16 00:47:41: Finished!

Is this error "IndexError: list index out of range" ignorable?

@SaberHQ
Copy link
Member

SaberHQ commented Feb 22, 2024

Hey @weishwu

It looks like that the IndexError happens when NanoSim is trying to calculate the ratio of head sequence length over the head+tail sequence length. This step is needed to add the head and tail regions to the generated reads.

The "IndexError: list index out of range" was a bug related to setting -min and -max parameters when simulating (#118) and I believe it is fixed now and I am surprised you encountered the similar error using the latest version.

Can you try the latest committed version instead of the released version and see if it works? Besides, I see you asked for 20k reads to be simulated. Can you confirm how many reads are simulated?

@waltergallegog
Copy link

Hi @SaberHQ
I also got an IndexError:

Traceback (most recent call last):
  File "/home/wgallego/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/wgallego/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wgallego/mambaforge/envs/nanosim/bin/simulator.py", line 1294, in simulation_aligned_genome
    head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range

I installed NanoSim on a new env using mamba

nanosim                   3.1.0                hdfd78af_0    bioconda

I'm using the latest model available for download on github, and I requested 22519449 reads to be simulated.

Here is the input command:

simulator.py genome -rg ../germ_hap_1.fasta -c ./human_giab_hg002_sub1M_kitv14_dorado/hg002_nanosim_sub1M -t 15 -n 22519449

I have counted the generated reads and they seem to correspond (maybe missing one):

$ wc -l simulated_aligned_reads.fasta
44941612 simulated_aligned_reads.fasta

$wc -l simulated_unaligned_reads.fasta
97284 simulated_unaligned_reads.fasta

# Obtained lines:
44941612 + 97284 = 45038896

# expected lines:
22519449 * 2 = 45038898

Here is the full output:

simulator.py genome -rg ../germ_hap_1.fasta -c /mnt/trcanmed/wgalleg
o/simul/nanosim_models/human_giab_hg002_sub1M_kitv14_dorado/hg002_nanosim_sub1M -t 15 -n 22519449

running the code with following parameters:

ref_g ../germ_hap_1.fasta
model_prefix /mnt/trcanmed/wgallego/simul/nanosim_models/human_giab_hg002_sub1M_kitv14_dorado/hg002_nanosim_sub1M
out simulated
number [22519449]
perfect False
kmer_bias None
basecaller None
dna_type linear
strandness None
sd_len None
median_len None
max_len inf
min_len 50
fastq False
chimeric False
num_threads 15
2024-06-03 18:12:19: /home/wgallego/mambaforge/envs/nanosim/bin/simulator.py genome -rg ../germ_hap_1.fasta -c /mnt/trcanmed/wgallego/simul/nanosim_models/human_giab_hg002_sub1M_kitv14_dorado/hg002_nanosim_sub1M -t 15 -n 22519449
2024-06-03 18:12:19: Read in reference
2024-06-03 18:12:48: Read error profile
2024-06-03 18:12:48: Read KDF of unaligned reads
/home/wgallego/mambaforge/envs/nanosim/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator KernelDensity from version 0.23.2 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
2024-06-03 18:12:49: Read KDF of aligned reads
2024-06-03 18:12:51: Read chimeric simulation information
2024-06-03 18:12:51: Start simulation of aligned reads
Process Process-10:: Number of reads simulated >> 22400001
Traceback (most recent call last):
  File "/home/wgallego/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/wgallego/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wgallego/mambaforge/envs/nanosim/bin/simulator.py", line 1294, in simulation_aligned_genome
    head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range
2024-06-03 23:47:13: Number of reads simulated >> 22470001
2024-06-04 03:31:30: Start simulation of random reads
2024-06-04 03:32:48: Number of reads simulated >> 22510001
2024-06-04 03:33:13: Finished!

@SaberHQ
Copy link
Member

SaberHQ commented Jun 6, 2024

Hi @waltergallegog ! Thanks for your interest in using NanoSim and reporting this issue.

It is a bit hard for me to trace back the issue without running and testing it on my end and without having access to the exact reference genome. However, from the error reported here, the error occurs at the following line: head_vs_ht_ratio = head_vs_ht_ratio_list[each_read] which is line 1309 inside simulator.py (not sure why it shows line 1294 on your report. Please double check you use the latest committed version of NanoSim from Github). Anyway, that line takes care of simulating the head and tail lengths (unaligned regions on both ends of ONT read that do not align to reference).

head_vs_ht_ratio_list is a list of ratios corresponding to head lengths over (head+tail) lengths. In line 1246 of the simulator.py, NanoSim filters out cases with a ratio over 1 which Kernal Density Estimations (KDE) generate.

My best guess would be that the error is related to that filtering stage. However, I need to do some benchmarks to narrow it down and make sure that is the case. Unfortunately I am busy until end of June. However, I will have some time to take a look next month for sure.

I will keep you updated on this. In the meantime, please try simulating some small number of reads and see if this happens or not.

@waltergallegog
Copy link

Hi @SaberHQ Thanks for the feedback
I will keep using the tool and report if I reproduce the issue again under different conditions.

Regarding the version, I'm using the one installed with mamba (v3.1.0) which from what I see is outdated with respect to the latest github commit, so I will update NanoSim and test again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants