bed_to_gff3 error #919

aperreault · 2019-08-14T14:37:40Z

Problem description

Hey there! I'm trying to convert ChIP peaks (in a bed file format) to gff3. Below is the command and the resulting error:

$ time gt bed_to_gff3 -force yes -o 0hr-CTCF_ChAsE-Ref_peak.gff3 bed-peaks/0hr-CTCF-ChIP-peaks.bed.txt
gt bed_to_gff3: error: strand '0.82' not one character long on line 1 in file 'bed-peaks/0hr-CTCF-ChIP-peaks.bed.txt'

I reformated the bed file to see if that was the issue and got the following error:

$ time gt bed_to_gff3 -force yes -o 0hr-CTCF_ChAsE-Ref_peak.gff3 bed-peaks/0hr-CTCF-ChIP-peaks_v2.bed.txt
gt bed_to_gff3: error: file "bed-peaks/0hr-CTCF-ChIP-peaks_v2.bed.txt": line 147357: expected character '
', got '�'
Command exited with non-zero status 1

What GenomeTools version are you reporting an issue for (as output by `gt -version`)?

$ gt -version
gt (GenomeTools) 1.5.10

Did you compile GenomeTools from source? If so, please state the `make` parameters used.

Yes. I didn't add any make parameters

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

Ubunutu 18.04.3

The text was updated successfully, but these errors were encountered:

satta · 2019-09-07T17:57:29Z

Can you share your bed file or part of it so we can use it to reproduce the issue? Thanks!
It looks like there is a problem with the formatting of your BED file, but we cannot confirm that without taking a closer look.

SSPuliasis · 2021-12-20T15:56:52Z

Has this been solved? I'm having the same issue

satta · 2021-12-20T17:39:35Z

The issue remains, since I do not remember receiving an example file. @SSPuliasis can you provide a (minimal?) example file that would allow me to reproduce the issue?
Feel free to remove identifying information or redact data if that's an issue as long as the issue still occurs trying to process that file.

SSPuliasis · 2021-12-22T10:58:13Z

Here is a small sample of the file I tried it on (saved as .txt here because github does not support .bed upload) . The error is:

"gt bed_to_gff3: error: file "sample_bed_file.bed": line 1: expected character '
', got 'A"

Thanks

sample_bed_file.txt

satta · 2021-12-22T12:48:25Z

Thanks, I'll take a look.

satta · 2021-12-22T12:56:31Z

Looks like your BED file has 14 columns while the spec (https://genome.ucsc.edu/FAQ/FAQformat.html#format1) only describes 12:

chr1	65418	71585	A0A2U3U0J3	0	+	65564	70007	255,0,0	3	15,54,2549	0,101,3618	A0A2U3U0J3	M1-K3,V4-F326
...

That's why the parser is confused about additional data where a newline should be.
I am curious: what tool created this file or what database is it from? It seems to deviate from the specification.

satta · 2021-12-22T13:14:45Z

As a workaround, you can simply cut off the superfluous fields and then pass the result to gt bed_to_gff3:

$ cat sample_bed_file.txt | cut -f1-12 | gt bed_to_gff3
##gff-version 3
##sequence-region   chr1 65419 686673
chr1	.	BED_feature	65419	71585	0	+	.	ID=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_block	65419	65433	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_block	65520	65573	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_thick_feature	65565	70007	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_block	69037	71585	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
###
chr1	.	BED_feature	69055	70108	0	+	.	ID=BED_feature2;Name=Q8NH21
chr1	.	BED_block	69055	70108	0	+	.	Parent=BED_feature2;Name=Q8NH21
chr1	.	BED_thick_feature	69091	70007	0	+	.	Parent=BED_feature2;Name=Q8NH21
###
chr1	.	BED_feature	450740	451678	0	-	.	ID=BED_feature3;Name=A0A126GV92
chr1	.	BED_block	450740	451678	0	-	.	Parent=BED_feature3;Name=A0A126GV92
###
chr1	.	BED_feature	450740	451678	0	-	.	ID=BED_feature4;Name=Q6IEY1
chr1	.	BED_block	450740	451678	0	-	.	Parent=BED_feature4;Name=Q6IEY1
###
chr1	.	BED_feature	685679	686673	0	-	.	ID=BED_feature5;Name=A0A126GV92
chr1	.	BED_block	685679	686673	0	-	.	Parent=BED_feature5;Name=A0A126GV92
###
chr1	.	BED_feature	685679	686673	0	-	.	ID=BED_feature6;Name=Q6IEY1
chr1	.	BED_block	685679	686673	0	-	.	Parent=BED_feature6;Name=Q6IEY1
###

SSPuliasis · 2021-12-22T14:05:55Z

Thanks for your help! :)

The bed file is from the UniProt human database, below is their description for each of their columns:

".bed
A BED detail formatted tab delimited file containing

Chromosome name.
Annotation start coordinate on the chromosome.
Annotation end coordinate on the chromosome.
UniProtKB accession, BED line name.
Score set to 0 as default.
DNA strand +/- for forward or reverse.
Thick start coordinate on the chromosome.
Thick end coordinate on the chromosome.
Annotation color (RGB).
Number of blocks representing the annotation.
Block sizes, a comma separated list.
Block starts, a comma separated list of block offsets relative to the
annotation start.
Annotation identifier. (accession in proteome file)
Annotation description, a semi-colon (;) separated list that can consist of:
1. amino acid or amino acid range the UniProt annotation covers or amino acid
  change (variants only).
2. annotation description
3. disease name and OMIM identifier (variants only)
4. PubMed literature evidence
  if available. This column has a maximum of 254 characters.

Missing values are represented by dots."

satta · 2021-12-22T20:03:51Z

I see... Weird that they simply extend the format and potentially break parsers... :/
I'm a fan of 'strict by default' parsers, but with the "be liberal with what you accept" concept in communications in mind I could imagine we can just eat (= discard) everything after the 12th column up to the newline character. That should be doable and might not change the general behaviour since the original spec does not define anything beyond the 12th column anyway.
I wonder if we can have a BED equivalent to the GFF3 -tidy parameter that tries to do "the right thing".

satta · 2021-12-25T20:51:52Z

Can we consider this done, btw?

satta · 2021-12-29T20:34:57Z

Tagging this as "enhancement" as there is now only a feature request to ignore BED columns beyond the 12th.

satta added needs information support labels Apr 23, 2020

satta self-assigned this Dec 22, 2021

satta removed the needs information label Dec 22, 2021

satta added the enhancement label Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bed_to_gff3 error #919

bed_to_gff3 error #919

aperreault commented Aug 14, 2019 •

edited by satta

satta commented Sep 7, 2019 •

edited

SSPuliasis commented Dec 20, 2021

satta commented Dec 20, 2021 •

edited

SSPuliasis commented Dec 22, 2021

satta commented Dec 22, 2021

satta commented Dec 22, 2021 •

edited

satta commented Dec 22, 2021

SSPuliasis commented Dec 22, 2021

satta commented Dec 22, 2021

satta commented Dec 25, 2021

satta commented Dec 29, 2021

bed_to_gff3 error #919

bed_to_gff3 error #919

Comments

aperreault commented Aug 14, 2019 • edited by satta

Problem description

What GenomeTools version are you reporting an issue for (as output by gt -version)?

Did you compile GenomeTools from source? If so, please state the make parameters used.

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

satta commented Sep 7, 2019 • edited

SSPuliasis commented Dec 20, 2021

satta commented Dec 20, 2021 • edited

SSPuliasis commented Dec 22, 2021

satta commented Dec 22, 2021

satta commented Dec 22, 2021 • edited

satta commented Dec 22, 2021

SSPuliasis commented Dec 22, 2021

satta commented Dec 22, 2021

satta commented Dec 25, 2021

satta commented Dec 29, 2021

aperreault commented Aug 14, 2019 •

edited by satta

What GenomeTools version are you reporting an issue for (as output by `gt -version`)?

Did you compile GenomeTools from source? If so, please state the `make` parameters used.

satta commented Sep 7, 2019 •

edited

satta commented Dec 20, 2021 •

edited

satta commented Dec 22, 2021 •

edited