Melon

Melon: metagenomic long-read-based taxonomic identification and quantification

Quick start

Installation

Create a new conda environment and install:

conda create -n melon -c conda-forge -c bioconda melon
conda activate melon

Database setup

Download either the NCBI or the GTDB database:

## GTDB
# wget -qN --show-progress https://zenodo.org/records/11076519/files/database.tar.gz

## NCBI
wget -qN --show-progress https://zenodo.org/records/11100615/files/database.tar.gz
tar -zxvf database.tar.gz

Index the files:

## If you encounter memory issue please consider manually lowering cpu_count or simply set cpu_count=1
cpu_count=$(python -c 'import os; print(os.cpu_count())')

diamond makedb --in database/prot.fa --db database/prot --quiet
ls database/nucl.*.fa | sort | xargs -P $cpu_count -I {} bash -c '
    filename=${1%.fa*};
    filename=${filename##*/};
    minimap2 -x map-ont -d database/$filename.mmi ${1} 2> /dev/null;
    echo "Indexed <database/$filename.fa>.";' - {}

## remove unnecessary files to save space
rm -rf database/*.fa

Run Melon

Note

Melon takes quality-controlled and decontaminated long reads as input. We suggest to remove low-quality raw reads before running Melon with e.g., nanoq -q 10 -l 1000 (minimal quality score 10; minimal read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or known eukaryotes/viruses, please consider removing them via proper mapping. If the origin of contamination is unknown, or if you want to estimate the mean genome size of prokaryotes, you may consider enabling the simple pre-filtering module. See Run Melon with pre-filtering of non-prokaryotic reads for more details.

We provide an example file comprising 10,000 quality-controlled (processed with Porechop and nanoq), prokaryotic reads (fungal and other reads removed with minimap2) randomly selected from the R10.3 mock sample of Loman Lab Mock Community Experiments.

wget -q --show-progress https://figshare.com/ndownloader/files/42847672/example.fa.gz
melon example.fa.gz -d database -o .

You should see:

INFO: Estimating genome copies ...
INFO: ... found 27.375 copies of genomes (bacteria: 27.375; archaea: 0).
INFO: Assigning taxonomy ...
INFO: Reassigning taxonomy ...
INFO: ... found 8 unique species (bacteria: 8; archaea: 0).
INFO: Done.

The output file *.tsv contains the estimated genome copies for individual species, their corresponding relative abundances and gap-compressed/gap-uncompressed ANI (average nucleotide identity between marker-gene-containing reads and reference genome clusters) values:

...    species                               copy     abundance       identity
...    287|Pseudomonas aeruginosa            2.125    7.762557e-02    0.9570/0.9473
...    96241|Bacillus spizizenii             2.875    1.050228e-01    0.9617/0.9531
...    1351|Enterococcus faecalis            3.000    1.095890e-01    0.9616/0.9534
...    28901|Salmonella enterica             3.125    1.141553e-01    0.9525/0.9433
...    562|Escherichia coli                  3.500    1.278539e-01    0.9588/0.9504
...    1639|Listeria monocytogenes           3.750    1.369863e-01    0.9627/0.9546
...    1280|Staphylococcus aureus            3.875    1.415525e-01    0.9598/0.9517
...    1613|Limosilactobacillus fermentum    5.125    1.872146e-01    0.9654/0.9574

The output file *.json contains the lineage and remark of each processed reads.

{
    "002617ff-697a-4cd5-8a97-1e136a792228": {
        "remark": "marker-gene-containing",
        "lineage": "2|Bacteria;1239|Bacillota;91061|Bacilli;186826|Lactobacillales;81852|Enterococcaceae;1350|Enterococcus;1351|Enterococcus faecalis"
    },
    ...
    "ffe73d61-55eb-4ad8-9519-e38c364fc11d": {
        "remark": "marker-gene-containing",
        "lineage": "2|Bacteria;1224|Pseudomonadota;1236|Gammaproteobacteria;91347|Enterobacterales;543|Enterobacteriaceae;590|Salmonella;28901|Salmonella enterica"
    }
}

Run Melon with pre-filtering of non-prokaryotic reads

To enable the pre-filtering module, you need to download a database of Kraken that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:

## https://benlangmead.github.io/aws-indexes/k2
mkdir database_kraken
wget -q --show-progress https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08gb_20230605.tar.gz -O database_kraken/db.tar.gz
tar -zxvf database_kraken/db.tar.gz -C database_kraken

## remove temporary files to save space
rm -rf database_kraken/db.tar.gz

Run with argument -k/-db-kraken:

melon *.fa -d database -o . -k database_kraken

Miscellaneous

Statistics of influent/effluent samples with/without pre-filter

Both samples were collected from the Shatin wastewater treatment plant, sequenced with ONT SQK-NBD114 & R10.4.1 on PromethION, basecalled with Guppy v6.5.7.

		no pre-filter	PlusPF-8	PlusPF-16	PlusPF
influent read: 1,091,772 bp: 7,815,976,467 n50: 9,500 mean quality: 18.3	genome copy	1,809	1,801	1,800	1,797
	species richness	2,101	2,098	2,099	2,097
	number of filtered reads	-	8,312	10,212	14,988
	mean genome size	4.322	4.304	4.300	4.292
	ARG abudnance	0.551	0.553	0.553	0.554
	real time (sec)	2,056	2,271	2,279	-
	peak resident set size (GB)	10.629	10.665	17.294	-
effluent read: 1,118,502 bp: 5,157,996,045 n50: 6,577 mean quality: 18.2	genome copy	1,348	1,336	1,331	1,315
	species richness	1,704	1,697	1,700	1,696
	number of filtered reads	-	16,602	29,300	54,774
	mean genome size	3.826	3.789	3.757	3.715
	ARG abundance	0.507	0.511	0.512	0.519
	real time (sec)	1,341	1,496	1,495	-
	peak resident set size (GB)	10.893	10.902	17.067	-

Tested on MacBook Pro 2021, Apple M1 Max, 64 GB memory, macOS Sonoma. Melon v0.1.0, NCBI database (ver. 2023-07-31) and Kraken database (ver. 2023-06-05). Mean genome size is in unit of Mb. ARG abundance is in unit of copies per cell (exlcuding multidrug ARGs). Real time and peak resident set size are measured with time.

Citation

Chen, Xi, Xiaole Yin, Xianghui Shi, Weifu Yan, Yu Yang, Lei Liu, and Tong Zhang. "Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes." bioRxiv (2023).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
src/melon		src/melon
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/melon

src/melon

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Melon

Quick start

Installation

Database setup

Run Melon

Run Melon with pre-filtering of non-prokaryotic reads

Miscellaneous

Statistics of influent/effluent samples with/without pre-filter

Citation

About

Releases 8

Packages

Languages

License

xinehc/melon

Folders and files

Latest commit

History

Repository files navigation

Melon

Quick start

Installation

Database setup

Run Melon

Run Melon with pre-filtering of non-prokaryotic reads

Miscellaneous

Statistics of influent/effluent samples with/without pre-filter

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages