A simple Python interface to query the biological databases kept at the NCBI.
It uses the Entrez Programming Utilities (E-utilities), nine server-side programs that access the Entrez query and database system at the National Center for Biotechnology Information (NCBI). They provide a structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
The main function (and the only essential one) is:
query(tool[, ...])
- yields the response of a query with the given tool
The function select
makes a selection of elements on the server,
that can be referenced later for future queries (instead of
downloading a long list of ids that then we would have to send to the
server again). The function apply
can then run a tool using a
previous selection of elements (which can be done with query
too,
but apply
has a simpler syntax):
select(tool, db[, ...])
- returns a dict that references the elements selected with tool over database dbapply(tool, db, selections[, retmax, ...])
- yields the response of applying a tool on db for the selected elements
Finally, on_search
is a convenience function that combines the
results of a select
on an apply
, which is a very common case.
on_search(term, db, tool[, dbfrom, ...])
- yields the response of applying a tool over the results of a search query (of term in database db)
The data often comes as xml. For convenience, there is also the
function read_xml
that converts it to a python object closely
resembling the original structure of the data.
You can download this repository and run from its directory without
installing anything, or simply put entrez.py
in a place where your
python interpreter can find it (for example, you can add its
directory to your
PYTHONPATH).
It is that easy, really; everything is in just one file. There is no
need to pip install
anything (but if you want, you can also run pip install -e .
from its directory to add entrez to your virtual
environment, conda environment, etc.).
Fetch information for SNP with id 3000, as in the example at https://www.ncbi.nlm.nih.gov/projects/SNP/SNPeutils.htm:
import entrez as ez
for line in ez.query(tool='fetch', db='snp', id='3000'):
print(line) # or: print(ez.read_xml(line)) for nicer output
Get a summary of nucleotides related to accession numbers
NC_010611.1
and EU477409.1
:
import entrez as ez
for line in ez.on_search(term='NC_010611.1[accn] OR EU477409.1[accn]',
db='nucleotide', tool='summary'):
print(line)
Download to file chimp.fna
all chimpanzee mRNA sequences in FASTA
format (our version of the sample application
3):
import entrez as ez
with open('chimp.fna', 'w') as fout:
for line in ez.on_search(term='chimpanzee[orgn] AND biomol mrna[prop]',
db='nucleotide', tool='fetch', rettype='fasta'):
fout.write(line + '\n')
In the examples directory, there is a program sample_applications.py that shows how the sample applications of the E-utilities would look like with this interface.
There are also some little programs: acc2gi.py uses the library to convert accession numbers into GIs, and sra2runacc.py uses entrez to get all the run accession numbers for a given SRA study.
The NCBI now asks for all requests to include email
as a parameter,
with the email address of the user making the request (see their
"General Usage
Guidelines" for
example).
You can pass it to any of the functions in this module as an argument
(for example, query(..., email='me@here.edu')
), or more comfortably
it can be initialized at the module level with:
import entrez as ez
ez.EMAIL = 'me@here.edu'
and from that point on, all the queries will have the email automatically incorporated.
Similarly, an API
key
can be passed to any of the functions as an argument
(query(..., api_key='ABCD123')
), or initialized and incorporated
automatically from that moment with:
import entrez as ez
ez.API_KEY = 'ABCD123'
There is also a small tool to perform web searches with
BLAST (Basic Local
Alignment Search Tool) at the NCBI, called web_blast.py
.
For example, if you want to perform a blast search on the
"non-redundant" database for the protein sequences that you have in a
file named sequences.fasta
, you can write:
./web_blast.py --program blastp --database nr --format Tabular sequences.fasta
You can run the tests in the tests
directory with:
pytest
which will run all the functions that start with test_
in the
test_*.py
files.
There is some more information in the wiki.
This program is licensed under the GPL v3. See the project license for further details.
When I initially wrote this module (circa 2016) there were no python alternatives (that I could find). That also explains why I chose to name it simply "entrez". Thanks to a more recent module, easy-entrez, here is a collection of alternatives: