Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RefSeq regular import of publications from WormBase #175

Open
MagdalenaZZ opened this issue Jun 21, 2021 · 2 comments
Open

RefSeq regular import of publications from WormBase #175

MagdalenaZZ opened this issue Jun 21, 2021 · 2 comments
Assignees

Comments

@MagdalenaZZ
Copy link

RefSeq would like to set up a regular import of publications from WormBase that can be used to better link gene and publication records.
We already have a process in place with MGI, RGD, and ZFIN that pulls in publication links off their respective FTP sites.
Is there a way we could do that from WormBase to get gene:publication pairs, with publication identifiers that we can convert to PMIDs?
We looked around in your downloads, and didn't find anything, but may not have been looking in the right spot.
I do see that references can be downloaded for individual genes but was looking for a bulk download option.

@MagdalenaZZ
Copy link
Author

  1. How often do you want/need updates?
    I’d say whatever seems appropriate to you. One option would be to dump a file with your regular release, maybe under REPORTS or MULTI_SPECIES? That seems like a logical frequency.
  1. Would you like to get the data from a flatfile, or would eg API work too?
    I think either would work. Most of our other dataflows are via flatfiles, but we use APIs for some. Do you already have an API suitable for this? We’d probably wind up downloading everything once a month – would that cause any performance issues?
  1. Would you like gene:publication from just C. elegans, or all species we have data from (95% of it is C.elegans, but some might be other nematodes)
    Everything sounds good.
  1. Just to confirm that your preferred lDs are: wormbase gene ID eg WBGene00023082 and publication PMID?
    Yep, that would work fine.
  1. Just to confirm that you want all the gene-paper associations?
    Our usual model is to try and report gene-paper links where the gene(s) are the principle focus of the paper. So not a list of 400 for a single paper. Our typical rule of thumb (what the MeSH group will link) is 4 genes for a paper. What kind of slices can you do with your data model? I’d love to hear more details of how you’ve set it up – it’s an area that we want to improve here, and your experiences would be very useful to hear!
  1. How soon do you need it?
    Sooner is always better, but whatever works into your development schedule. A test file (if we wind up going the file route) would be ideal, and then starting with whichever WormBase release is appropriate.

@MagdalenaZZ
Copy link
Author

I think I've found a fairly easy way, going to the wormbase FTP site, and get this file: ftp://ftp.wormbase.org/pub/wormbase/releases/WS280/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS280.reuters_citation_index.xml.gz

For C. elegans you can get the organism, pubmedID and gene like this (it is an XML file, so a simple XML parsing should work okay in a script):

zcat wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.*.reuters_citation_index.xml.gz | grep -e '' -e 'citation pubmed_id' -e 'record record_id='

At the moment we only have this file for C. elegans, but we might be able to extend it to the other species too.

For each gene, you will have listed a number of publications associated with that gene. Would you be happy with this data?

@MagdalenaZZ MagdalenaZZ self-assigned this Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant