Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the usage and processing of the expression profile #207

Open
joyceFunk123 opened this issue Apr 18, 2024 · 1 comment
Open

Comments

@joyceFunk123
Copy link

Hello,

I am simulating reads in transcriptome mode. For this purpose, I have created an expression profile in the specified format. The read counts generated per transcript are different from what I would expect based on the expression profile. Is there a way to get the read counts per transcript in the same proportion as the specified tpm values? Or is there a way to give the exact number of reads per transcript?
Also, I'm wondering how the given expression profile values are processed for the changes to occur.

Thanks a lot!
Joyce

@SaberHQ
Copy link
Member

SaberHQ commented Apr 25, 2024

Hey @joyceFunk123

Thanks for your interest in using NanoSim. NanoSim uses the quantification file to choose transcripts to simulate reads from. It is an approximation estimation and does not reflect the same number of reads.

That being said, in our analysis we showed that there is a very high correlation between the estimated transcript abundance of the empirical dataset and the simulated dataset generated by Trans-NanoSim, indicating that the observed raw transcript expression level is well replicated by Trans-NanoSim (Figure 1.C in Trans-NanoSim paper). I highly recommend you take a look at it.

The feature you asked for should be also interesting to implement and it has been requested before. However, I am not sure if I will get some time to implement that into NanoSim in future releases.

Currently, NanoSim takes the -n option as input which reflects the number of reads to be simulated. It first selects a transcript based on the expression profile and then simulates a sequence out of it based on read profiles.

Considering that those expression levels are reported in TPM, you may generate 1 million reads to have a similar number of reads generated from a transcript. It should be the closest approximation, otherwise, for the exact number of reads, I have to implement an option to only rely on expression profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants