Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unannotated ORFs #161

Open
singhbhavya opened this issue May 1, 2024 · 6 comments
Open

Unannotated ORFs #161

singhbhavya opened this issue May 1, 2024 · 6 comments

Comments

@singhbhavya
Copy link

Hi there! I have a question about how RiboTricer handles unannotated sequences.

Purpose: I am trying to run RiboTricer after generating BAM files using SQUIRE (which does both gene and transposable element quantification), and I want to know which TE reads in my ribo-seq data are actively translating.

Problem: When I run ribotricer prepare-orfs, all the candidate ORFs correspond with gene names, which confuses me. I don't see many unannotated regions that don't have names.

Question: Does RiboTricer predict ALL candidate ORFs (including repeat elements)? If so, how does it name them? As long as all the predicted ORFs have a start end end site, that's all I need, since I have a TE annotation. I have a .txt repeatmasker file with specific TE sequences; can I somehow use that within RiboTricer?

@singhbhavya
Copy link
Author

singhbhavya commented May 2, 2024

Hi there! I re-ran RiboTricer with a TE annotation (gtf). All the samples ran up until this point. I'm having trouble making sense of this error message; please let me know if you could help! The "index" file warning is not an issue, because the same samples ran with the same warning when the gtf was different. It seems the program fails after "WARNING: no periodic read length found... using cutoff 0.418", but no other warnings or errors are reported. In addition, all the reported files and graphs (except for read_length_dist) are empty. The metagene_profile tables are all also exclusively populated with zeroes.

(ribotricer_env) $ ribotricer detect-orfs --bam results/squire/Aza3-A/Aza3-A.bam --ribotricer_index results/ribotricer/ribotricer_candidate_orfs_TE.tsv --prefix results/ribotricer/Aza3-A_TE/Aza3-A_ribotricer --phase_score_cutoff 0.418
May 02 11:28:16 ..... started ribotricer detect-orfs
May 02 11:28:16 ... started parsing ribotricer index file
May 02 11:28:16 ... started inferring experimental design
[W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
May 02 11:34:20 ... started reading bam file
[W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
  0%|                                                                                                                                                                                                         | 0/196983294 [00:00<?, ?reads/s][W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
May 02 11:44:47 ... started plotting read length distribution                                                                                                                                                                                  
May 02 11:44:48 ... started calculating metagene profiles. This may take a long time...
                                                                                                                                                                                                                                               
May 02 11:44:48 ... started plotting metagene profiles
May 02 11:44:48 ... started inferring P-site offsets
WARNING: no periodic read length found... using cutoff 0.418

@saketkc
Copy link
Collaborator

saketkc commented May 2, 2024

Hi @singhbhavya,

Problem: When I run ribotricer prepare-orfs, all the candidate ORFs correspond with gene names, which confuses me. I don't see many unannotated regions that don't have names.

How are your TE elements annotated in the gtf?

Question: Does RiboTricer predict ALL candidate ORFs (including repeat elements)? If so, how does it name them? As long as all the predicted ORFs have a start end end site, that's all I need, since I

Yes, but it is dependent on how you generate the index (and thereby what was your input gtf). It would be helpful to know what your annotation (gtf) looks like.

@singhbhavya
Copy link
Author

@saketkc
Copy link
Collaborator

saketkc commented May 6, 2024

hi @singhbhavya,

Wehn I generate the index using the gtf you shared, it looks okay to me:

ORF_ID  ORF_type        transcript_id   transcript_type gene_id gene_name       gene_type       chrom   strand  start_codon     coordinate
Lx2B2_8387999_8388064_66        novel   Lx2B2   assumed_protein_coding  Lx2B2   Lx2B2:TE        assumed_protein_coding  chr1    +       AAG     838799
9-8388064
Lx2B2_8388080_8388172_93        novel   Lx2B2   assumed_protein_coding  Lx2B2   Lx2B2:TE        assumed_protein_coding  chr1    +       GTG     838808
0-8388172

Have you tried learning the cutoff first?

@singhbhavya
Copy link
Author

My index looks the same as the one you generated; I don't think there's a problem with the index. I'm providing a few more details, in case it's helpful. This is what the bam_summary.txt looks like .

summary:
        total_reads: 196983294
        unique_mapped: 42718270
        qcfail: 0
        duplicate: 0
        secondary: 137879498
        unmapped:0
        multi:16385526

length dist:
        12: 7243
        13: 13751
        14: 18484
        15: 158495
        16: 3939702
        17: 955860
        18: 278908
        19: 326656
        20: 299653
        21: 223939
        22: 152389
        23: 151489
        24: 159219
        25: 267425
        26: 718551
        27: 1896587
        28: 2635753
        29: 3829454
        30: 10824078
        31: 3964525
        32: 2213332
        33: 2675884
        34: 1493565
        35: 1063974
        36: 618872
        37: 1505331
        38: 814373
        39: 595494
        40: 320907
        41: 162713
        42: 157413
        43: 107373
        44: 72974
        45: 37687
        46: 17676
        47: 9595
        48: 5303
        49: 2835
        50: 1379

What's super unusual is the ribotricer_protocol, which checks only 4 reads. For the other (gene) annotation, it checks far more than 4.

In total 4 reads checked:
        Number of reads explained by "++, --": 2 (0.5000)
        Number of reads explained by "+-, -+": 2 (0.5000)

I have RNA-seq data available to me for these samples, and I'll try learning the cutoff. Do you think the cutoff should be different for TE reads compared to gene?

Thank you so, so much for your help so far!! It is very appreciated.

@saketkc
Copy link
Collaborator

saketkc commented May 8, 2024

Hi @singhbhavya, the read distribution shows a high enrichment of 29-31nt reads which is great. The protocol inference seems a bit concerning (could be a bug). If you point me to the bam file (or a subset of it), I will be happy to take a deeper look. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants