RefSeq regular import of publications from WormBase #175

MagdalenaZZ · 2021-06-21T19:08:04Z

RefSeq would like to set up a regular import of publications from WormBase that can be used to better link gene and publication records.
We already have a process in place with MGI, RGD, and ZFIN that pulls in publication links off their respective FTP sites.
Is there a way we could do that from WormBase to get gene:publication pairs, with publication identifiers that we can convert to PMIDs?
We looked around in your downloads, and didn't find anything, but may not have been looking in the right spot.
I do see that references can be downloaded for individual genes but was looking for a bulk download option.

MagdalenaZZ · 2021-06-21T19:09:02Z

How often do you want/need updates?
I’d say whatever seems appropriate to you. One option would be to dump a file with your regular release, maybe under REPORTS or MULTI_SPECIES? That seems like a logical frequency.

Would you like to get the data from a flatfile, or would eg API work too?
I think either would work. Most of our other dataflows are via flatfiles, but we use APIs for some. Do you already have an API suitable for this? We’d probably wind up downloading everything once a month – would that cause any performance issues?

Would you like gene:publication from just C. elegans, or all species we have data from (95% of it is C.elegans, but some might be other nematodes)
Everything sounds good.

Just to confirm that your preferred lDs are: wormbase gene ID eg WBGene00023082 and publication PMID?
Yep, that would work fine.

Just to confirm that you want all the gene-paper associations?
Our usual model is to try and report gene-paper links where the gene(s) are the principle focus of the paper. So not a list of 400 for a single paper. Our typical rule of thumb (what the MeSH group will link) is 4 genes for a paper. What kind of slices can you do with your data model? I’d love to hear more details of how you’ve set it up – it’s an area that we want to improve here, and your experiences would be very useful to hear!

How soon do you need it?
Sooner is always better, but whatever works into your development schedule. A test file (if we wind up going the file route) would be ideal, and then starting with whichever WormBase release is appropriate.

MagdalenaZZ · 2021-06-21T19:35:03Z

I think I've found a fairly easy way, going to the wormbase FTP site, and get this file: ftp://ftp.wormbase.org/pub/wormbase/releases/WS280/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS280.reuters_citation_index.xml.gz

For C. elegans you can get the organism, pubmedID and gene like this (it is an XML file, so a simple XML parsing should work okay in a script):

zcat wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.*.reuters_citation_index.xml.gz | grep -e '' -e 'citation pubmed_id' -e 'record record_id='

At the moment we only have this file for C. elegans, but we might be able to extend it to the other species too.

For each gene, you will have listed a number of publications associated with that gene. Would you be happy with this data?

MagdalenaZZ self-assigned this Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RefSeq regular import of publications from WormBase #175

RefSeq regular import of publications from WormBase #175

MagdalenaZZ commented Jun 21, 2021

MagdalenaZZ commented Jun 21, 2021

MagdalenaZZ commented Jun 21, 2021

RefSeq regular import of publications from WormBase #175

RefSeq regular import of publications from WormBase #175

Comments

MagdalenaZZ commented Jun 21, 2021

MagdalenaZZ commented Jun 21, 2021

MagdalenaZZ commented Jun 21, 2021