Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Common misconceptions and mistakes for anvi'o beginners that we should clarify in a blog post #2230

Open
ivagljiva opened this issue Feb 26, 2024 · 5 comments
Assignees

Comments

@ivagljiva
Copy link
Contributor

@FlorianTrigodet and I have noticed several common issues emerging from recent Discord questions. These are not really bugs, but quirks of the anvi'o ecosystem that are not obvious to beginners. Many of them are even documented across our various tutorials and help pages, but clearly people aren't finding or understanding those pages because they keep making the same mistakes.

We thought this could be addressed with a new blog post, something like "Important things to learn when starting with anvi'o" or "FAQ and common issues for anvi'o beginners", in which we could have a section for each issue that 1) summarizes the convention, 2) explains why we do it this way, 3) provides links to related tutorials or documentation, and 4) explains what to do if you've already made the mistake. Then, when people ask these common questions, we'll be able to send them the URL to the appropriate section. We'll also be able to direct new users to read through the page first so that they can hopefully avoid some headaches.

Of course, we should also update the relevant anvi'o help pages associated with each issue (but it would still be useful to have these common issues described in one central location, IMO).

Here we are starting a list of these 'common issues', and once we collect several, we can put them together into a post. We welcome ideas and contributions from the community for this effort :)

To start the list, I scanned through the most recent Discord threads to identify the common themes (and I'm also drawing from my memory of things that I always find myself explaining to people in workshops and stuff).

Mismatch between reformatted contig headers and BAM files

A lot of people have the problem that they did some metagenomic read recruitment to reference FASTA files with contig headers that are incompatible with anvi'o. They use anvi-script-reformat-fasta so that they can make the contigs database, but then run into issues later when trying to run anvi-profile because the contig names in their contigs db don't match to those in their BAM files.

See the following Discord questions:

What we need to tell people is this:

  • contigs (and sample names) in anvi'o must be named only with alphanumeric characters plus underscores. No dashes, no colons, no spaces
  • why does anvi'o restrict which characters can be used for headers?
    • we believe the answer is that other 3rd-party programs occasionally used by anvi'o cannot handle certain special characters, so we put a blanket restriction on everyone from the very start to avoid hiccups later
    • but we could also make the point that it just makes everything easier for coding the backend if you can assume you are only dealing with alphanumeric characters plus underscore
  • general advice is to always run anvi-script-reformat-fasta --simplify-names before doing anything else in your workflow, and to always use the --report-file flag even though its not required
  • what happens if you have BAM files with incompatible headers?
    • well, if you have the reformat report file, then you can use anvi-script-reformat-bam (currently in anvio v8-dev only) thanks to Andrea
    • but if not, you have to re-run your mapping with the reformatted FASTA file as a reference

General confusion about profile databases vs single profiles and merging

The main problem here is that many people don't really think about what is going on behind the scenes with profiling and merging. They probably just see those steps in the metagenomic workflow and assume that they always have to run them.

Which leads to Discord questions like these:

I think we should direct new users to:

  • learn what is the point of a profile db and what kind of data it holds (with link to the profile db help page)
  • consider why we would need/want to combine mapping data that relates to the same reference
  • realize that we can only merge when 1) we have read mapping data (ie, you cannot merge two contigs db), 2) we have more than one sample, and 3) mapping reference matches across those samples.

If you want to add/remove genomes from a pangenome, you need to re-compute the pangenome

... with the caveat that if you are doing enrichment on the pangenome with categorical variables, you can exclude genomes from the enrichment analysis without removing them from the pangenome (as discussed here). Here are the related Discord questions:

General confusion about importing misc data and data orders into databases

(or sometimes people aren't even aware that this is a thing they can do).

What information does anvi-summarize provide in each input case?

I often find myself recommending people to use anvi-summarize to get the data they want, but I always forget what output files it gives you, so it is hard to determine if it is the appropriate solution for someone's question. Even when people find and use anvi-summarize by themselves, they sometimes have questions about what each data type means. For example:

This doesn't necessarily need its own section in the FAQ post, but is more of a note that we should update the help page for anvi-summarize to describe the output files you get when you run it on a contigs + profile db vs a pangenome, etc.

General confusion about external vs internal genomes

... and making people aware that you CAN combine multiple genomes into the same contigs DB for combined analysis by reformatting the contig headers with --prefix and importing a collections txt.

@meren
Copy link
Member

meren commented Feb 26, 2024

This is a great point and a welcome attempt to ameliorate. I had hoped our help pages would address these issues, but I guess they are not enough by themselves as you point out.

Of course, we should also update the relevant anvi'o help pages associated with each issue (but it would still be useful to have these common issues described in one central location, IMO).

But I couldn't agree more with this statement above.

The funny thing is, we're using Discord so that the answers accumulate over time, so we don't have to respond to the same questions over and over again. But then, we realize we do that still, and now we are trying to put together an F.A.Q. by going through Discord :p Kind of funny and sad at the same time.

@ivagljiva
Copy link
Contributor Author

Yeah, it is a little bit frustrating. I think one reason this doesn't work:

we're using Discord so that the answers accumulate over time, so we don't have to respond to the same questions over and over again.

Is that the search functions for posts is really bad (much worse than in Slack). From my experience, it seems like the search function only looks through the titles of posts, not the content of each thread. The titles of posts are generally very very poorly written, so of course people don't find anything. And sometimes people are posting questions within other threads that are only marginally related, so it gets lost that way.

And more likely than not, a lot of people just don't bother to read what was posted before, or to search the help pages at all. But I'm not sure how to discourage this behavior without refusing to answer people who haven't done their due diligence first, which feels wrong. Especially since it is not always clear if someone tried to look through previous posts or help pages, unless they explicitly say so).

Hopefully this effort will yield improvements to the most commonly-needed help pages so that we have multiple links to throw at people with these specific issues 😞

@FlorianTrigodet
Copy link
Contributor

Is that the search functions for posts is really bad (much worse than in Slack). From my experience, it seems like the search function only looks through the titles of posts, not the content of each thread.

There are (unfortunately) two search bars in Discord. There is the big one that is very inviting but only search terms in the post's title. And there is a second, smaller one in the top right corner that is a proper search bar and works nicely.

Screenshot 2024-02-27 at 10 44 56

I can modify that screen shot and we could add it in the discord's rule-and-guidelines channel.

@meren
Copy link
Member

meren commented Feb 27, 2024

And more likely than not, a lot of people just don't bother to read what was posted before, or to search the help pages at all. But I'm not sure how to discourage this behavior without refusing to answer people who haven't done their due diligence first, which feels wrong.

This highlights so well the dilemma inflicted upon people whose goal is to develop solutions that try to match the sophistication of the questions they aim to address.

While we don't want to alienate or push away those who don't have time or interest to read even the most clear error message that already explains them the problem and the solution, we are taking more and more time from our primary tasks to help them.

The more I think about it, the more I realize that we need a revolution rather than a yet another solution that will not go beyond what we have been already doing: trying to help those who will have time to read things (which often don't need our help).

So what would be the revolution in this context? Well, probably developing a language model that processes all our code, documentation, and Discord material periodically to give access to that nebula of wisdom through a chatbot. In an ideal world, the precious time of those who are genuinely thinking of the future of this community would be better spent on investigating available technologies to establish such a long-term solution than a blog post. But I know we do not live in an ideal world, and we are just trying to put out fires most of the time. Which is also admirable and needed, and this is what that blog post will do. So I am not saying let's stop doing this and do the other thing. But I just wanted to share my 2 cents in case it turns a light bulb in someone else's mind.

@Ge0rges
Copy link
Collaborator

Ge0rges commented Mar 15, 2024

Perhaps an online Anvi'o forum that would get indexed by Google would help with this on the long term? For example a hosted Discourse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants