Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault core dumped on the derep step with Kinnex full length reads #1959

Closed
sbedwell27 opened this issue May 18, 2024 · 6 comments

Comments

@sbedwell27
Copy link

Hi Ben,

I am using PacBio Kinnex full length 16S reads and am having a hard time using the dada2 pipeline with them. I ran this code successfully as a test code with two samples and it ran and outputted data successfully, however, now that I am doing it on 150 samples I am getting errors I have never seen. The code was continually erroring out, either on memory or segmentation fault. on one particular sample. However when I removed that sample from the dataset I am still getting fatal seg faults on a different sample. Have you seen this before? Do you have any suggestions? I attached a screenshot of the error and the code I am using.

Thanks!

Screenshot 2024-05-18 at 16 11 53

.libPaths(c("/home/sierrab4/Rlibs", .libPaths()))
.libPaths()

#load packages
library(dada2);packageVersion("dada2")
library(Biostrings); packageVersion("Biostrings")
library(ShortRead); packageVersion("ShortRead")
library(ggplot2); packageVersion("ggplot2")
library(reshape2); packageVersion("reshape2")
library(gridExtra); packageVersion("gridExtra")
library(phyloseq); packageVersion("phyloseq")
library(Rcpp); packageVersion("Rcpp")

path <- "/projects/sib/labs/kheath/kinnex_2024/first_150"
list.files(path)
path.out <- "Figures/"
path.rds <- "Rdata_and_RDS/"
fnseqs <- list.files(path, pattern="fastq.gz", full.names=TRUE)
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())

#coding out all of this 5/17 to try and get it to dereplicate faster without running into errors

nops <- file.path(path, "noprimers", basename(fnseqs))
#prim <- removePrimers(fnseqs, nops, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE)

#note: there is another way to remove primers that Chris Fields sent from someone else
#I am just using Ben Callahan's version

#Inspect length distribution.

#pdf("histogram.pdf")

#lens.fn <- lapply(nops, function(fn) nchar(getSequences(fn)))
#lens <- do.call(c, lens.fn)
#hist(lens, 100)

#dev.off()

#Look for peaks around 1450, this is the length of the 16S sequence

#Filter

filts <- file.path(path, "noprimers", "filtered", basename(fnseqs))
track <- filterAndTrim(nops, filts, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2)
track

#run DADA2

#Dereplicate

drp <- derepFastq(filts, verbose=TRUE)
`

@sbedwell27
Copy link
Author

From some more educated guessing I suspect it has to do with not enough memory (I was only using 30 cores to do this which may not be enough). I am trying a smaller dataset with more cores at the moment.

@benjjneb
Copy link
Owner

Your issue is probably due to memory. Long-read datasets require more memory per read than short-read datasets.

The current dada2 recommended workflow (see dada2 tutorial) does not load all samples into memory at the same time as you are doing in the derepFastq command. If you follow this workflow, your maximum memory requirements will be much lower.

@sbedwell27
Copy link
Author

Thank you for your response! I was using your dada2 tutorial for PacBio Kinnex reads (https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html). What is the advantage of using this over the standard Illumina dada2 pipeline?

@sbedwell27
Copy link
Author

Hi, just want to follow up on this. Should I be using the current PacBio workflow that you linked, or the PacBio Kinnex tutorial for fecal reads?

@benjjneb
Copy link
Owner

The current dada2 tutorial is the best place to start. The reproducible analyses associated with the initial DADA2+PacBio manuscript that you linked above is also very useful. The key difference relevant to your analyses is that current dada2 does not recommend using the derepFastq command explicitly, but instead you should pass the files into the learnErrors and dada functions. Those functions now perform dereplication on-the-fly, and avoid loading all samples into memory at once, which is probably what is causing your seg-fault error.

@sbedwell27
Copy link
Author

That makes sense, thank you! I tried this and it seems to be working so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants