Skip to content

jordibc/entrez

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entrez - Call the NCBI E-utilities from Python

A simple Python interface to query the biological databases kept at the NCBI.

It uses the Entrez Programming Utilities (E-utilities), nine server-side programs that access the Entrez query and database system at the National Center for Biotechnology Information (NCBI). They provide a structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.

The main function (and the only essential one) is:

  • query(tool[, ...]) - yields the response of a query with the given tool

The function select makes a selection of elements on the server, that can be referenced later for future queries (instead of downloading a long list of ids that then we would have to send to the server again). The function apply can then run a tool using a previous selection of elements (which can be done with query too, but apply has a simpler syntax):

  • select(tool, db[, ...]) - returns a dict that references the elements selected with tool over database db
  • apply(tool, db, selections[, retmax, ...]) - yields the response of applying a tool on db for the selected elements

Finally, on_search is a convenience function that combines the results of a select on an apply, which is a very common case.

  • on_search(term, db, tool[, dbfrom, ...]) - yields the response of applying a tool over the results of a search query (of term in database db)

The data often comes as xml. For convenience, there is also the function read_xml that converts it to a python object closely resembling the original structure of the data.

Installation

You can download this repository and run from its directory without installing anything, or simply put entrez.py in a place where your python interpreter can find it (for example, you can add its directory to your PYTHONPATH).

It is that easy, really; everything is in just one file. There is no need to pip install anything (but if you want, you can also run pip install -e . from its directory to add entrez to your virtual environment, conda environment, etc.).

Examples

Fetch information for SNP with id 3000, as in the example at https://www.ncbi.nlm.nih.gov/projects/SNP/SNPeutils.htm:

import entrez as ez

for line in ez.query(tool='fetch', db='snp', id='3000'):
    print(line)  # or:  print(ez.read_xml(line))  for nicer output

Get a summary of nucleotides related to accession numbers NC_010611.1 and EU477409.1:

import entrez as ez

for line in ez.on_search(term='NC_010611.1[accn] OR EU477409.1[accn]',
                         db='nucleotide', tool='summary'):
    print(line)

Download to file chimp.fna all chimpanzee mRNA sequences in FASTA format (our version of the sample application 3):

import entrez as ez

with open('chimp.fna', 'w') as fout:
    for line in ez.on_search(term='chimpanzee[orgn] AND biomol mrna[prop]',
                             db='nucleotide', tool='fetch', rettype='fasta'):
        fout.write(line + '\n')

In the examples directory, there is a program sample_applications.py that shows how the sample applications of the E-utilities would look like with this interface.

There are also some little programs: acc2gi.py uses the library to convert accession numbers into GIs, and sra2runacc.py uses entrez to get all the run accession numbers for a given SRA study.

Email and API keys

The NCBI now asks for all requests to include email as a parameter, with the email address of the user making the request (see their "General Usage Guidelines" for example).

You can pass it to any of the functions in this module as an argument (for example, query(..., email='me@here.edu')), or more comfortably it can be initialized at the module level with:

import entrez as ez
ez.EMAIL = 'me@here.edu'

and from that point on, all the queries will have the email automatically incorporated.

Similarly, an API key can be passed to any of the functions as an argument (query(..., api_key='ABCD123')), or initialized and incorporated automatically from that moment with:

import entrez as ez
ez.API_KEY = 'ABCD123'

Web blast

There is also a small tool to perform web searches with BLAST (Basic Local Alignment Search Tool) at the NCBI, called web_blast.py.

For example, if you want to perform a blast search on the "non-redundant" database for the protein sequences that you have in a file named sequences.fasta, you can write:

./web_blast.py --program blastp --database nr --format Tabular sequences.fasta

Tests

You can run the tests in the tests directory with:

pytest

which will run all the functions that start with test_ in the test_*.py files.

References

Extra documentation

There is some more information in the wiki.

License

This program is licensed under the GPL v3. See the project license for further details.

Alternatives

When I initially wrote this module (circa 2016) there were no python alternatives (that I could find). That also explains why I chose to name it simply "entrez". Thanks to a more recent module, easy-entrez, here is a collection of alternatives:

About

A simple Python interface to the NCBI databases (Entrez).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published