Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement: use wc to count bp instead of summing contig header info for abundance #124

Open
franciscozorrilla opened this issue Apr 12, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@franciscozorrilla
Copy link
Owner

For abundance calculation we need to get bp of genome, currently we do this based on the contig headers info, although this can become problematic for genomes assembled by different tools (e.g. shovill renames them iirc). Replace this with a simple wc command to count bases in fasta

metaGEM/workflow/Snakefile

Lines 1118 to 1124 in a4daea0

# Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)
if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]] || [[ $bin == *.s.fa ]] || [[ $bin == *.p.fa ]];then
less $bin |grep ">"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
else
less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
fi

@franciscozorrilla franciscozorrilla self-assigned this Apr 12, 2023
@franciscozorrilla franciscozorrilla added the enhancement New feature or request label Apr 12, 2023
@franciscozorrilla
Copy link
Owner Author

Temporary workaround to deal with megahit, shovil, and metaGEM outputs

# Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)
            if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]] || [[ $bin == *.s.fa ]] || [[ $bin == *.p.fa ]];then
                less $bin |grep ">"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            elif [[ $bin == *.shovill.fa ]];then
                less $bin |grep ">"|cut -d ' ' -f2|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            elif [[ $bin == *megahit.fa ]];then
                less $bin |grep ">"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            else
                less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
            fi

Rather than keep extending this for each possible case, just use wc to count bp after removing newlines

genome.fa|grep -v '>'|tr -d '\n'|wc -m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant