Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge baseurl and pattern for scraper clients (#7077) #7227

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
8 changes: 8 additions & 0 deletions changelog/7077.breaking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Dataretriever / "Scraper" clients no longer require the regex-formatted ``baseurl`` and a parse-formatted ``pattern`` variable but instead a single and full ``pattern`` variable written in the ``parse``-format.
Documentation about how to write the new patterns and about explaining the internal Scraper algorithm is added to the topic guide on adding new sources to Fido.

The :meth:``~sunpy.net.scraper.Scraper.extract_files_meta`` function no longer requires an extractor pattern.

A new submodule called ``~scraper.net.scraper_utils`` is created and Scraper helper functions like ``date_floor()``, ``extract_timestep()``, ``check_timerange()`` and ``get_timerange_from_exdict()`` can be accessed directly from there.

*All* the extracted timeranges have a millisecond subtracted from the end date, i.e. they end on 59:59:59 of the date just before, instead of the inconsistent issue where some could end with 00:00:00 of the end date which lead to undesirable cases like January 1, 2015 data also showing up in the 2014 year-long timerange.
2 changes: 2 additions & 0 deletions docs/reference/net.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,5 @@ for `sunpy.net.Fido`.
.. automodapi:: sunpy.net.attr

.. automodapi:: sunpy.net.scraper

.. automodapi:: sunpy.net.scraper_utils
70 changes: 56 additions & 14 deletions docs/topic_guide/extending_fido.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,30 +17,72 @@ The main place this is done is when constructing a `~.UnifiedResponse` object, w

.. _sunpy-topic-guide-new-source-for-fido-add-new-scraper-client:

Writing a new "scraper" client
==============================
A brief explanation of how "scraper" clients work
=================================================

A "scraper" Fido client (also sometimes referred to as a "data retriever" client) is a Fido client which uses the URL `~sunpy.net.scraper.Scraper` to find files on remote servers.
If the data provider you want to integrate does not provide a tree of files with predictable URLs then a "full" client is more likely to provide the functionality you need.

A new "scraper" client inherits from `~sunpy.net.dataretriever.client.GenericClient` and requires a minimum of these three components:
A new "scraper" client inherits from `~sunpy.net.dataretriever.client.GenericClient` and requires a minimum of these two components:

* A class method :meth:`~sunpy.net.base_client.BaseClient.register_values`; this registers the "attrs" that are supported by the client.
It returns a dictionary where keys are the supported attrs and values are lists of tuples.
Each `tuple` contains the "attr" value and its description.
* A class attribute ``baseurl``; this is a regular expression which is used to match the URLs supported by the client.
* A class attribute ``pattern``; this is a template used to extract the metadata from URLs matched by ``baseurl``.
The extraction uses the `~sunpy.extern.parse.parse` format.
* A class attribute ``pattern``; this is a string used to match the URLs supported by the client and extract metadata from the matched URLs.
The time and other metadata attributes for extraction are written using the `~sunpy.extern.parse.parse` format, using double curly-brackets so to differentiate them from ``kwargs`` parameters which are written in single curly-brackets.

Each such client relies on the `~sunpy.net.scraper.Scraper` to be able to query for files using the :meth:`~sunpy.net.scraper.Scraper.filelist` method. The general algorithm to explain how the `~sunpy.net.scraper.Scraper` is able to do this is:

1. It takes as input a generalised ``pattern`` of how a desired filepath looks like, following the ``parse`` format. A version of the pattern following the datetime format is also generated, called the ``timepattern``.

.. code-block:: python

>>> from sunpy.net import Scraper
>>> pattern = ('http://proba2.oma.be/{instrument}/data/bsd/{{year:4d}}/{{month:2d}}/{{day:2d}}/'
... '{instrument}_lv1_{{year:4d}}{{month:2d}}{{day:2d}}_{{hour:2d}}{{minute:2d}}{{second:2d}}.fits')
>>> s = Scraper(pattern, instrument='swap')

2. The smallest unit of time / time-step for that directory pattern (the full timepattern minus the filename at the end) is then detected by using :meth:`~sunpy.net.scraper_utils.extract_timestep`.

.. code-block:: python

>>> from sunpy.net.scraper_utils import extract_timestep
>>> extract_timestep("http://proba2.oma.be/swap/data/bsd/%Y/%m/%d/swap_lv1_%Y%m%d_%H%M%S.fits") # timepattern = 'http://proba2.oma.be/swap/data/bsd/%Y/%m/%d/swap_lv1_%Y%m%d_%H%M%S.fits'
relativedelta(seconds=+1)

3. After that `~sunpy.net.scraper.Scraper.range` is called on the pattern where for each time between start and stop, in units of the timestep, the time is "floored" according to the pattern via the :meth:`~sunpy.net.scraper_utils.date_floor` method and then the directory pattern is filled with it.

.. code-block:: python

>>> from sunpy.time import TimeRange
>>> timerange = TimeRange('2015-01-01T00:08:00','2015-01-03T00:00:00')
>>> s.range(timerange)
['http://proba2.oma.be/swap/data/bsd/2015/01/01/',
'http://proba2.oma.be/swap/data/bsd/2015/01/02/',
'http://proba2.oma.be/swap/data/bsd/2015/01/03/']

4. The location given by the filled pattern is visited and a list of files at the location is obtained. This is handled differently depending on whether the pattern is a web URL or a ``file://`` or an ``ftp://`` path in the :meth:`~sunpy.net.scraper.Scraper.filelist` method.
5. The name of each file present is then examined to determine if it matches the remaining portion of the pattern using :meth:`~sunpy.extern.parse.parse`.
6. Each such file is then checked for lying in the intended timerange using the :meth:`~sunpy.net.scraper_utils.check_timerange` method which in turn uses :meth:`sunpy.net.scraper_utils.get_timerange_from_exdict` to get the covered timerange for each file. The files that satisfy these conditions are then added to the output.

.. code-block:: python

>>> s.filelist(timerange) # doctest: +REMOTE_DATA
['http://proba2.oma.be/swap/data/bsd/2015/01/01/swap_lv1_20150101_000857.fits',
'http://proba2.oma.be/swap/data/bsd/2015/01/01/swap_lv1_20150101_001027.fits',
'...',
'http://proba2.oma.be/swap/data/bsd/2015/01/01/swap_lv1_20150101_235947.fits']

Writing a new "scraper" client
==============================
The `~sunpy.net.scraper` thus allows us to write Fido clients for a variety of sources. For a simple example of a scraper client, we can look at the implementation of `sunpy.net.dataretriever.sources.eve.EVEClient` in sunpy.

For a simple example of a scraper client, we can look at the implementation of `sunpy.net.dataretriever.sources.eve.EVEClient` in sunpy.
A version without documentation strings is reproduced below:

.. code-block:: python

class EVEClient(GenericClient):
baseurl = (r'http://lasp.colorado.edu/eve/data_access/evewebdata/quicklook/'
r'L0CS/SpWx/%Y/%Y%m%d_EVE_L0CS_DIODES_1m.txt')
pattern = '{}/SpWx/{:4d}/{year:4d}{month:2d}{day:2d}_EVE_L{Level:1d}{}'
pattern = ('http://lasp.colorado.edu/eve/data_access/evewebdata/quicklook/L0CS/SpWx/'
'{{year:4d}}/{{year:4d}}{{month:2d}}{{day:2d}}_EVE_L{{Level:1d}}CS_DIODES_1m.txt')

@classmethod
def register_values(cls):
Expand All @@ -53,10 +95,10 @@ A version without documentation strings is reproduced below:
return adict

This client scrapes all the URLs available under the base url ``http://lasp.colorado.edu/eve/data_access/evewebdata/quicklook/L0CS/SpWx/``.
`~sunpy.net.scraper.Scraper` is primarily focused on URL parsing based on time ranges, so the rest of the ``baseurl`` pattern specifies where in the pattern the time information is located, using `strptime <https://strftime.org/>`__ notation.
The ``pattern`` attribute is used to populate the results table from the URLs matched with the ``baseurl``.
`~sunpy.net.scraper.Scraper` is primarily focused on URL parsing based on time ranges, so the rest of the ``pattern`` specifies where in the URL the time information is located, using `parse <https://github.com/r1chardj0n3s/parse/>`__ notation.
The ``pattern`` attribute is first filled in with the calculated time-based values, and then used to populate the results table from the URLs matched with the ``pattern``.
It includes some of the time definitions, as well as names of attrs (in this case "Level").
The supported time keys are: 'year', 'month', 'day', 'hour', 'minute', 'second', 'millisecond'.
The supported time keys are: '{year:4d}', '{year:2d}', '{month:2d}'. '{month_name:l}', '{month_name_abbr:l}', '{day:2d}', '{day_of_year:3d}', '{hour:2d}', '{minute:2d}', '{second:2d}', '{microsecond:6d}', '{millisecond:3d}' and '{week_number:2d}'.

The attrs returned in the ``register_values()`` method are used to match your client to a search, as well as adding their values to the attr.
This means that after this client has been imported, running ``print(a.Provider)`` will show that the ``EVEClient`` has registered a provider value of ``LASP``.
Expand Down
20 changes: 9 additions & 11 deletions sunpy/net/dataretriever/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
from sunpy.net import attrs as a
from sunpy.net.attr import SimpleAttr
from sunpy.net.base_client import BaseClient, QueryResponseRow, QueryResponseTable
from sunpy.net.scraper import Scraper, get_timerange_from_exdict
from sunpy.net.scraper import Scraper
from sunpy.net.scraper_utils import get_timerange_from_exdict
from sunpy.time import TimeRange
from sunpy.util.parfive_helpers import Downloader

Expand Down Expand Up @@ -57,9 +58,7 @@
:meth:`~sunpy.net.dataretriever.GenericClient.post_search_hook`.
They help to translate the attrs for scraper before and after the search respectively.
"""
baseurl = None
# A regex string that can match all urls supported by the client.
# A string which is used to extract the desired metadata from urls correctly,
# A string which is used to match all files and extract the desired metadata from urls correctly,
# using ``sunpy.extern.parse.parse``.
pattern = None
# Set of required 'attrs' for client to handle the query.
Expand Down Expand Up @@ -115,12 +114,12 @@
@classmethod
def pre_search_hook(cls, *args, **kwargs):
"""
Helper function to return the baseurl, pattern and matchdict
for the client required by :func:`~sunpy.net.dataretriever.GenericClient.search`
Helper function to return the pattern and matchdict for the
client required by :func:`~sunpy.net.dataretriever.GenericClient.search`
before using the scraper.
"""
matchdict = cls._get_match_dict(*args, **kwargs)
return cls.baseurl, cls.pattern, matchdict
return cls.pattern, matchdict

@classmethod
def _can_handle_query(cls, *query):
Expand Down Expand Up @@ -233,11 +232,10 @@
-------
A `QueryResponse` instance containing the query result.
"""
baseurl, pattern, matchdict = self.pre_search_hook(*args, **kwargs)
scraper = Scraper(baseurl, regex=True)
pattern, matchdict = self.pre_search_hook(*args, **kwargs)
scraper = Scraper(pattern)

Check warning on line 236 in sunpy/net/dataretriever/client.py

View check run for this annotation

Codecov / codecov/patch

sunpy/net/dataretriever/client.py#L235-L236

Added lines #L235 - L236 were not covered by tests
tr = TimeRange(matchdict['Start Time'], matchdict['End Time'])
filesmeta = scraper._extract_files_meta(tr, extractor=pattern,
matcher=matchdict)
filesmeta = scraper._extract_files_meta(tr, matcher=matchdict)

Check warning on line 238 in sunpy/net/dataretriever/client.py

View check run for this annotation

Codecov / codecov/patch

sunpy/net/dataretriever/client.py#L238

Added line #L238 was not covered by tests
filesmeta = sorted(filesmeta, key=lambda k: k['url'])
metalist = []
for i in filesmeta:
Expand Down
5 changes: 2 additions & 3 deletions sunpy/net/dataretriever/sources/eve.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,8 @@ class EVEClient(GenericClient):
<BLANKLINE>

"""
baseurl = (r'https://lasp.colorado.edu/eve/data_access/eve_data/quicklook/'
r'L0CS/SpWx/%Y/%Y%m%d_EVE_L0CS_DIODES_1m.txt')
pattern = '{}/SpWx/{:4d}/{year:4d}{month:2d}{day:2d}_EVE_L{Level:1d}{}'
pattern = ('https://lasp.colorado.edu/eve/data_access/eve_data/quicklook/L0CS/SpWx/'
'{{year:4d}}/{{year:4d}}{{month:2d}}{{day:2d}}_EVE_L{{Level:1d}}CS_DIODES_1m.txt')

@property
def info_url(self):
Expand Down
4 changes: 2 additions & 2 deletions sunpy/net/dataretriever/sources/fermi_gbm.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ class GBMClient(GenericClient):

"""

baseurl = r'https://heasarc.gsfc.nasa.gov/FTP/fermi/data/gbm/daily/%Y/%m/%d/current/glg_(\w){5}_(\w){2}_%y%m%d_.*\.pha'
pattern = '{}/daily/{year:4d}/{month:2d}/{day:2d}/current/glg_{Resolution:5}_{Detector:2}_{:6d}{}'
pattern = ('https://heasarc.gsfc.nasa.gov/FTP/fermi/data/gbm/daily/{{year:4d}}/{{month:2d}}/{{day:2d}}/current/'
'glg_{{Resolution:5}}_{{Detector:2}}_{{year:2d}}{{month:2d}}{{day:2d}}_v00.pha')

@property
def info_url(self):
Expand Down