Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bed_to_gff3 error #919

Open
aperreault opened this issue Aug 14, 2019 · 11 comments
Open

bed_to_gff3 error #919

aperreault opened this issue Aug 14, 2019 · 11 comments

Comments

@aperreault
Copy link

aperreault commented Aug 14, 2019

Problem description

Hey there! I'm trying to convert ChIP peaks (in a bed file format) to gff3. Below is the command and the resulting error:

$ time gt bed_to_gff3 -force yes -o 0hr-CTCF_ChAsE-Ref_peak.gff3 bed-peaks/0hr-CTCF-ChIP-peaks.bed.txt
gt bed_to_gff3: error: strand '0.82' not one character long on line 1 in file 'bed-peaks/0hr-CTCF-ChIP-peaks.bed.txt'

I reformated the bed file to see if that was the issue and got the following error:

$ time gt bed_to_gff3 -force yes -o 0hr-CTCF_ChAsE-Ref_peak.gff3 bed-peaks/0hr-CTCF-ChIP-peaks_v2.bed.txt
gt bed_to_gff3: error: file "bed-peaks/0hr-CTCF-ChIP-peaks_v2.bed.txt": line 147357: expected character '
', got '�'
Command exited with non-zero status 1

What GenomeTools version are you reporting an issue for (as output by gt -version)?

$ gt -version
gt (GenomeTools) 1.5.10

Did you compile GenomeTools from source? If so, please state the make parameters used.

Yes. I didn't add any make parameters

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

Ubunutu 18.04.3

@satta
Copy link
Member

satta commented Sep 7, 2019

Can you share your bed file or part of it so we can use it to reproduce the issue? Thanks!
It looks like there is a problem with the formatting of your BED file, but we cannot confirm that without taking a closer look.

@SSPuliasis
Copy link

Has this been solved? I'm having the same issue

@satta
Copy link
Member

satta commented Dec 20, 2021

The issue remains, since I do not remember receiving an example file. @SSPuliasis can you provide a (minimal?) example file that would allow me to reproduce the issue?
Feel free to remove identifying information or redact data if that's an issue as long as the issue still occurs trying to process that file.

@SSPuliasis
Copy link

Here is a small sample of the file I tried it on (saved as .txt here because github does not support .bed upload) . The error is:

"gt bed_to_gff3: error: file "sample_bed_file.bed": line 1: expected character '
', got 'A"

Thanks

sample_bed_file.txt

@satta
Copy link
Member

satta commented Dec 22, 2021

Thanks, I'll take a look.

@satta
Copy link
Member

satta commented Dec 22, 2021

Looks like your BED file has 14 columns while the spec (https://genome.ucsc.edu/FAQ/FAQformat.html#format1) only describes 12:

chr1	65418	71585	A0A2U3U0J3	0	+	65564	70007	255,0,0	3	15,54,2549	0,101,3618	A0A2U3U0J3	M1-K3,V4-F326
...

That's why the parser is confused about additional data where a newline should be.
I am curious: what tool created this file or what database is it from? It seems to deviate from the specification.

@satta satta self-assigned this Dec 22, 2021
@satta
Copy link
Member

satta commented Dec 22, 2021

As a workaround, you can simply cut off the superfluous fields and then pass the result to gt bed_to_gff3:

$ cat sample_bed_file.txt | cut -f1-12 | gt bed_to_gff3
##gff-version 3
##sequence-region   chr1 65419 686673
chr1	.	BED_feature	65419	71585	0	+	.	ID=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_block	65419	65433	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_block	65520	65573	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_thick_feature	65565	70007	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
chr1	.	BED_block	69037	71585	0	+	.	Parent=BED_feature1;Name=A0A2U3U0J3
###
chr1	.	BED_feature	69055	70108	0	+	.	ID=BED_feature2;Name=Q8NH21
chr1	.	BED_block	69055	70108	0	+	.	Parent=BED_feature2;Name=Q8NH21
chr1	.	BED_thick_feature	69091	70007	0	+	.	Parent=BED_feature2;Name=Q8NH21
###
chr1	.	BED_feature	450740	451678	0	-	.	ID=BED_feature3;Name=A0A126GV92
chr1	.	BED_block	450740	451678	0	-	.	Parent=BED_feature3;Name=A0A126GV92
###
chr1	.	BED_feature	450740	451678	0	-	.	ID=BED_feature4;Name=Q6IEY1
chr1	.	BED_block	450740	451678	0	-	.	Parent=BED_feature4;Name=Q6IEY1
###
chr1	.	BED_feature	685679	686673	0	-	.	ID=BED_feature5;Name=A0A126GV92
chr1	.	BED_block	685679	686673	0	-	.	Parent=BED_feature5;Name=A0A126GV92
###
chr1	.	BED_feature	685679	686673	0	-	.	ID=BED_feature6;Name=Q6IEY1
chr1	.	BED_block	685679	686673	0	-	.	Parent=BED_feature6;Name=Q6IEY1
###

@SSPuliasis
Copy link

Thanks for your help! :)

The bed file is from the UniProt human database, below is their description for each of their columns:

".bed
A BED detail formatted tab delimited file containing

  • Chromosome name.
  • Annotation start coordinate on the chromosome.
  • Annotation end coordinate on the chromosome.
  • UniProtKB accession, BED line name.
  • Score set to 0 as default.
  • DNA strand +/- for forward or reverse.
  • Thick start coordinate on the chromosome.
  • Thick end coordinate on the chromosome.
  • Annotation color (RGB).
  • Number of blocks representing the annotation.
  • Block sizes, a comma separated list.
  • Block starts, a comma separated list of block offsets relative to the
    annotation start.
  • Annotation identifier. (accession in proteome file)
  • Annotation description, a semi-colon (;) separated list that can consist of:
    1. amino acid or amino acid range the UniProt annotation covers or amino acid
      change (variants only).
    2. annotation description
    3. disease name and OMIM identifier (variants only)
    4. PubMed literature evidence
      if available. This column has a maximum of 254 characters.

Missing values are represented by dots."

@satta
Copy link
Member

satta commented Dec 22, 2021

I see... Weird that they simply extend the format and potentially break parsers... :/
I'm a fan of 'strict by default' parsers, but with the "be liberal with what you accept" concept in communications in mind I could imagine we can just eat (= discard) everything after the 12th column up to the newline character. That should be doable and might not change the general behaviour since the original spec does not define anything beyond the 12th column anyway.
I wonder if we can have a BED equivalent to the GFF3 -tidy parameter that tries to do "the right thing".

@satta
Copy link
Member

satta commented Dec 25, 2021

Can we consider this done, btw?

@satta
Copy link
Member

satta commented Dec 29, 2021

Tagging this as "enhancement" as there is now only a feature request to ignore BED columns beyond the 12th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants