Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persisten temp folder not getting removed #11

Open
kristyhoran opened this issue Apr 4, 2018 · 11 comments
Open

persisten temp folder not getting removed #11

kristyhoran opened this issue Apr 4, 2018 · 11 comments

Comments

@kristyhoran
Copy link

kristyhoran commented Apr 4, 2018

hi @jacarrico,
I've been trying to implement chewBBACA in a nextflow pipeline, using a schema generated with PrepExternalSchema (separate to the nextflow pipeline). When I run AlleleCall a temp folder is created in my schemaDirectory, this seems to lead to error in subsequent runs with this schema. It raises a ValueError
ValueError: '/listeria_db/lmo1074.fasta' is not in list it should be noted that lmo1074.fasta is not an actual file in the listeria_db directory. So I am a). unsure where this file name even comes from and b). why the temp folder is persisting. I am using chewBBACA version 2.0.8.
Thanks in advance for your time, I appreciate any help that you can give me.
Regards
Kristy Horan

@tseemann
Copy link

tseemann commented Apr 4, 2018

I suggest using https://docs.python.org/3/library/tempfile.html which will automatically clean up for you at end of process.

I note that assemblerflow manually attempts to clean it up with rm -r blah/temp but this should be done by standalone chewie. you should consider the database folder to be READ ONLY as that is how it will be on many systems once schemes are locked down.

@mickaelsilva
Copy link
Member

@kristyhoran The temp folder is persisting because for some reason there was an error. Standard chewBBACA behavior is to not remove the temp folder when an error occurs because if it occurs in the middle of an allele call it allows the user to resume that allele call, instead of starting over. For this reason using python tempfile would remove this feature, do you agree @tseemann ?

@kristyhoran considering your error, are you giving chewBBACA -g a list of files or the folder? I would suggest to use the folder input as it should scan it in search of the fastas to use on the allele call. Also be aware that if the temp folder is not removed the program will prompt a question if you want to continue or not, use the --fr or --fc to force reset or continue (respectively) without any prompt.

@kristyhoran
Copy link
Author

@mickaelsilva Thanks for your reply, The -g input is the directory, not a list of files. I am trying to use chewBBACA in a nextflow pipeline, which runs multiple isolates in parallel, using a symlink to the schema directory. I have attempted --fr, which raises the error temp folder not found and --fc which raises the error mentioned above. This is not a problem if I run chewbbaca as a standalone command, it is only a problem when using it within a nextflow pipeline, since there are multiple processes accessing the same schemaDir. I have also tried to remove the temp folder manually at the start of the pipeline, however, since other processes are potentially accessing the folder at the same time this also raises an error. One possible solution would be to copy the schema Dir into the process working dir, but this seems redundant and also very time consuming when working with a large number of isolates. I understand the attractiveness of being able to resume an allele call instead of starting over and this is not probably a problem for using chewbbaca alone, but it does raise the issue of running it within a pipeline. I am not sure what the optimal solution would be, but appreciate any suggestions or help you may have. Regards Kristy

@tseemann
Copy link

tseemann commented Apr 5, 2018

The key questions are:

  1. does your script work when TWO concurrent processes are running using the SAME allele database?

  2. usually databases will be read-only and stored elsewhere. It currently will not work. Your temp folder belongs in the output folder, not the database folder? it's like expecting to be able to write to the --db /bio/db/refseq_proteins folder when using blastp ?

  3. Chewie only takes 1 minute (?) per sample so supporting "checkpointing" might be overkill?

@jacarrico
Copy link
Collaborator

Just a quick note that we are working on this and this issue is not forgotten
There is always a question of allele calling and the database that is being used in the allele call for a given batch. We are decoupling allele call and allele naming to test the feasibility. Nevertheless you always have to have a single step where you name the alleles and update the db for future allele calls. In other words, we always need to have a batch step and, using the current paradigm of using files as the allele database (names+ sequence), this not seems to be possible.
We are finalising this version and will do testing in the following weeks.

@LordRust
Copy link

LordRust commented Jun 1, 2018

Good to hear that there is work being done on this. If it is indeed being released soon, then I can skip my patches that I was working on with using a .lockfile to ensure that only one chewbbaca instance was using the database at any one time. But a question arises regarding how you are implementing this. Often when batch analysing data there will be ~identical isolates from a cluster; if two processes are running concurrently and the first one finds a new allele and subsequently commits it to the file database in a final batch step, and then the second process, which is just seconds behind the first one, reaches the same "new" allele - then it will assign a new allele number to the same allele that process 1 hasn't had the time to write to the files yet. Or do you have a check for this somehow? There is a beauty in the simple file based approach, but I fear data corruption is looming. Or maybe I misunderstood, and you are abandoning the files for a database instead?

@jacarrico
Copy link
Collaborator

Hello Jonas
that is indeed the problem. The allele classification needs to be queue and our efforts has been in parallelising the allele sequence extraction first and then create a queue for the allele naming. That way multiple processes can run the allele sequence extraction in parallel but the allele ID attribution still needs to be done by batch following a queue. We don't plan yet to abandon the file based approach and we think that this 2-step approach will avoid data corruption. We are also working on a centralised nomenclature server where authorized user can synch new alleles . We have the first part done (the allele sequence extraction) but we still need to implement the second part. It may take a few weeks still since we have other things on the queue first.

@LordRust
Copy link

LordRust commented Jun 1, 2018

OK, patching with a lock file for now then. Thank you for the update

@jacarrico
Copy link
Collaborator

We will close this issue when we have the new version implemented.

@ramirma
Copy link
Member

ramirma commented Sep 6, 2021

A long overdue update on this. Although we are still working on decoupling allele identification and commitment of new alleles to the local database (more for the purpose of allowing user to run test batches or sets of genomes of uncertain provenance and quality without "polluting" their database), I would also like to call you attention to chewie-NS. The idea here is that every user will have its how local instance of chewBBACA that can be synced with the public or private instance of chewie-NS when necessary or required. This means that even within the same institution different users can work independently and only adopt a common nomenclature through synchronization with chewie-NS when needed. This approach addresses the issue raised by @tseemann although a future version of chewBBACA will indeed be able to perform allele call without changes to the local database as suggested above.

@ramirma
Copy link
Member

ramirma commented Aug 14, 2023

Another quick update on this issue. For those of you interested in this please explore the option of running chewBBACA without identifying or storing novel alleles. You can do this by exploring the --no-inferred or the --mode flags (more info in chewBBACA's readthedocs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants