sunpy · nabobalis · Oct 9, 2023 · Oct 9, 2023 · Apr 7, 2024 · Apr 25, 2024
diff --git a/changelog/7077.breaking.rst b/changelog/7077.breaking.rst
@@ -0,0 +1,8 @@
+Dataretriever / "Scraper" clients no longer require the regex-formatted ``baseurl`` and a parse-formatted ``pattern`` variable but instead a single and full ``pattern`` variable written in the ``parse``-format.
+Documentation about how to write the new patterns and about explaining the internal Scraper algorithm is added to the topic guide on adding new sources to Fido.
+
+The :meth:``~sunpy.net.scraper.Scraper.extract_files_meta`` function no longer requires an extractor pattern.
+
+A new submodule called ``~scraper.net.scraper_utils`` is created and Scraper helper functions like ``date_floor()``, ``extract_timestep()``, ``check_timerange()`` and ``get_timerange_from_exdict()`` can be accessed directly from there.
+
+*All* the extracted timeranges have a millisecond subtracted from the end date, i.e. they end on 59:59:59 of the date just before, instead of the inconsistent issue where some could end with 00:00:00 of the end date which lead to undesirable cases like January 1, 2015 data also showing up in the 2014 year-long timerange.
diff --git a/docs/reference/net.rst b/docs/reference/net.rst
@@ -90,3 +90,5 @@ for `sunpy.net.Fido`.
 .. automodapi:: sunpy.net.attr
 
 .. automodapi:: sunpy.net.scraper
+
+.. automodapi:: sunpy.net.scraper_utils
diff --git a/docs/topic_guide/extending_fido.rst b/docs/topic_guide/extending_fido.rst
@@ -17,30 +17,72 @@ The main place this is done is when constructing a `~.UnifiedResponse` object, w
 
 .. _sunpy-topic-guide-new-source-for-fido-add-new-scraper-client:
 
-Writing a new "scraper" client
-==============================
+A brief explanation of how "scraper" clients work
+=================================================
 
 A "scraper" Fido client (also sometimes referred to as a "data retriever" client) is a Fido client which uses the URL `~sunpy.net.scraper.Scraper` to find files on remote servers.
-If the data provider you want to integrate does not provide a tree of files with predictable URLs then a "full" client is more likely to provide the functionality you need.
 
-A new "scraper" client inherits from `~sunpy.net.dataretriever.client.GenericClient` and requires a minimum of these three components:
+A new "scraper" client inherits from `~sunpy.net.dataretriever.client.GenericClient` and requires a minimum of these two components:
 
 * A class method :meth:`~sunpy.net.base_client.BaseClient.register_values`; this registers the "attrs" that are supported by the client.
   It returns a dictionary where keys are the supported attrs and values are lists of tuples.
   Each `tuple` contains the "attr" value and its description.
-* A class attribute ``baseurl``; this is a regular expression which is used to match the URLs supported by the client.
-* A class attribute ``pattern``; this is a template used to extract the metadata from URLs matched by ``baseurl``.
-  The extraction uses the `~sunpy.extern.parse.parse` format.
+* A class attribute ``pattern``; this is a string used to match the URLs supported by the client and extract metadata from the matched URLs.
+  The time and other metadata attributes for extraction are written using the `~sunpy.extern.parse.parse` format, using double curly-brackets so to differentiate them from ``kwargs`` parameters which are written in single curly-brackets.
+
+Each such client relies on the `~sunpy.net.scraper.Scraper` to be able to query for files using the :meth:`~sunpy.net.scraper.Scraper.filelist` method. The general algorithm to explain how the `~sunpy.net.scraper.Scraper` is able to do this is:
+
+1. It takes as input a generalised ``pattern`` of how a desired filepath looks like, following the ``parse`` format. A version of the pattern following the datetime format is also generated, called the ``timepattern``.
+
+.. code-block:: python
+
+    >>> from sunpy.net import Scraper
+    >>> pattern = ('http://proba2.oma.be/{instrument}/data/bsd/{{year:4d}}/{{month:2d}}/{{day:2d}}/'
+    ...            '{instrument}_lv1_{{year:4d}}{{month:2d}}{{day:2d}}_{{hour:2d}}{{minute:2d}}{{second:2d}}.fits')
+    >>> s = Scraper(pattern, instrument='swap')
+
+2. The smallest unit of time / time-step for that directory pattern (the full timepattern minus the filename at the end) is then detected by using :meth:`~sunpy.net.scraper_utils.extract_timestep`.
+
+.. code-block:: python
+
+    >>> from sunpy.net.scraper_utils import extract_timestep
+    >>> extract_timestep("http://proba2.oma.be/swap/data/bsd/%Y/%m/%d/swap_lv1_%Y%m%d_%H%M%S.fits") # timepattern = 'http://proba2.oma.be/swap/data/bsd/%Y/%m/%d/swap_lv1_%Y%m%d_%H%M%S.fits'
+    relativedelta(seconds=+1)
+
+3. After that `~sunpy.net.scraper.Scraper.range` is called on the pattern where for each time between start and stop, in units of the timestep, the time is "floored" according to the pattern via the :meth:`~sunpy.net.scraper_utils.date_floor` method and then the directory pattern is filled with it.
+
+.. code-block:: python
+
+    >>> from sunpy.time import TimeRange
+    >>> timerange = TimeRange('2015-01-01T00:08:00','2015-01-03T00:00:00')
+    >>> s.range(timerange)
+    ['http://proba2.oma.be/swap/data/bsd/2015/01/01/',
+    'http://proba2.oma.be/swap/data/bsd/2015/01/02/',
+    'http://proba2.oma.be/swap/data/bsd/2015/01/03/']
+
+4. The location given by the filled pattern is visited and a list of files at the location is obtained. This is handled differently depending on whether the pattern is a web URL or a ``file://`` or an ``ftp://`` path in the :meth:`~sunpy.net.scraper.Scraper.filelist` method.
+5. The name of each file present is then examined to determine if it matches the remaining portion of the pattern using :meth:`~sunpy.extern.parse.parse`.
+6. Each such file is then checked for lying in the intended timerange using the :meth:`~sunpy.net.scraper_utils.check_timerange` method which in turn uses :meth:`sunpy.net.scraper_utils.get_timerange_from_exdict` to get the covered timerange for each file. The files that satisfy these conditions are then added to the output.
+
+.. code-block:: python
+
+    >>> s.filelist(timerange) # doctest: +REMOTE_DATA
+    ['http://proba2.oma.be/swap/data/bsd/2015/01/01/swap_lv1_20150101_000857.fits',
+    'http://proba2.oma.be/swap/data/bsd/2015/01/01/swap_lv1_20150101_001027.fits',
+    '...',
+    'http://proba2.oma.be/swap/data/bsd/2015/01/01/swap_lv1_20150101_235947.fits']
+
+Writing a new "scraper" client
+==============================
+The `~sunpy.net.scraper` thus allows us to write Fido clients for a variety of sources. For a simple example of a scraper client, we can look at the implementation of `sunpy.net.dataretriever.sources.eve.EVEClient` in sunpy.
 
-For a simple example of a scraper client, we can look at the implementation of `sunpy.net.dataretriever.sources.eve.EVEClient` in sunpy.
 A version without documentation strings is reproduced below:
 
 .. code-block:: python
 
     class EVEClient(GenericClient):
-        baseurl = (r'http://lasp.colorado.edu/eve/data_access/evewebdata/quicklook/'
-                    r'L0CS/SpWx/%Y/%Y%m%d_EVE_L0CS_DIODES_1m.txt')
-        pattern = '{}/SpWx/{:4d}/{year:4d}{month:2d}{day:2d}_EVE_L{Level:1d}{}'
+        pattern = ('http://lasp.colorado.edu/eve/data_access/evewebdata/quicklook/L0CS/SpWx/'
+               '{{year:4d}}/{{year:4d}}{{month:2d}}{{day:2d}}_EVE_L{{Level:1d}}CS_DIODES_1m.txt')
 
         @classmethod
         def register_values(cls):
@@ -53,10 +95,10 @@ A version without documentation strings is reproduced below:
             return adict
 
 This client scrapes all the URLs available under the base url ``http://lasp.colorado.edu/eve/data_access/evewebdata/quicklook/L0CS/SpWx/``.
-`~sunpy.net.scraper.Scraper` is primarily focused on URL parsing based on time ranges, so the rest of the ``baseurl`` pattern specifies where in the pattern the time information is located, using `strptime <https://strftime.org/>`__ notation.
-The ``pattern`` attribute is used to populate the results table from the URLs matched with the ``baseurl``.
+`~sunpy.net.scraper.Scraper` is primarily focused on URL parsing based on time ranges, so the rest of the ``pattern`` specifies where in the URL the time information is located, using `parse <https://github.com/r1chardj0n3s/parse/>`__ notation.
+The ``pattern`` attribute is first filled in with the calculated time-based values, and then used to populate the results table from the URLs matched with the ``pattern``.
 It includes some of the time definitions, as well as names of attrs (in this case "Level").
-The supported time keys are: 'year', 'month', 'day', 'hour', 'minute', 'second', 'millisecond'.
+The supported time keys are: '{year:4d}', '{year:2d}', '{month:2d}'. '{month_name:l}', '{month_name_abbr:l}', '{day:2d}', '{day_of_year:3d}', '{hour:2d}', '{minute:2d}', '{second:2d}', '{microsecond:6d}', '{millisecond:3d}' and '{week_number:2d}'.
 
 The attrs returned in the ``register_values()`` method are used to match your client to a search, as well as adding their values to the attr.
 This means that after this client has been imported, running ``print(a.Provider)`` will show that the ``EVEClient`` has registered a provider value of ``LASP``.

@@ -8,7 +8,8 @@
 from sunpy.net import attrs as a
 from sunpy.net.attr import SimpleAttr
 from sunpy.net.base_client import BaseClient, QueryResponseRow, QueryResponseTable
-from sunpy.net.scraper import Scraper, get_timerange_from_exdict
+from sunpy.net.scraper import Scraper
+from sunpy.net.scraper_utils import get_timerange_from_exdict
 from sunpy.time import TimeRange
 from sunpy.util.parfive_helpers import Downloader
 
@@ -57,9 +58,7 @@
     :meth:`~sunpy.net.dataretriever.GenericClient.post_search_hook`.
     They help to translate the attrs for scraper before and after the search respectively.
     """
-    baseurl = None
-    # A regex string that can match all urls supported by the client.
-    # A string which is used to extract the desired metadata from urls correctly,
+    # A string which is used to match all files and extract the desired metadata from urls correctly,
     # using ``sunpy.extern.parse.parse``.
     pattern = None
     # Set of required 'attrs' for client to handle the query.
@@ -115,12 +114,12 @@
     @classmethod
     def pre_search_hook(cls, *args, **kwargs):
         """
-        Helper function to return the baseurl, pattern and matchdict
-        for the client required by :func:`~sunpy.net.dataretriever.GenericClient.search`
+        Helper function to return the pattern and matchdict for the
+        client required by :func:`~sunpy.net.dataretriever.GenericClient.search`
         before using the scraper.
         """
         matchdict = cls._get_match_dict(*args, **kwargs)
-        return cls.baseurl, cls.pattern, matchdict
+        return cls.pattern, matchdict
 
     @classmethod
     def _can_handle_query(cls, *query):
@@ -233,11 +232,10 @@
         -------
         A `QueryResponse` instance containing the query result.
         """
-        baseurl, pattern, matchdict = self.pre_search_hook(*args, **kwargs)
-        scraper = Scraper(baseurl, regex=True)
+        pattern, matchdict = self.pre_search_hook(*args, **kwargs)
+        scraper = Scraper(pattern)
         tr = TimeRange(matchdict['Start Time'], matchdict['End Time'])
-        filesmeta = scraper._extract_files_meta(tr, extractor=pattern,
-                                                matcher=matchdict)
+        filesmeta = scraper._extract_files_meta(tr, matcher=matchdict)
         filesmeta = sorted(filesmeta, key=lambda k: k['url'])
         metalist = []
         for i in filesmeta:

@@ -35,9 +35,8 @@ class EVEClient(GenericClient):
     <BLANKLINE>
 
     """
-    baseurl = (r'https://lasp.colorado.edu/eve/data_access/eve_data/quicklook/'
-               r'L0CS/SpWx/%Y/%Y%m%d_EVE_L0CS_DIODES_1m.txt')
-    pattern = '{}/SpWx/{:4d}/{year:4d}{month:2d}{day:2d}_EVE_L{Level:1d}{}'
+    pattern = ('https://lasp.colorado.edu/eve/data_access/eve_data/quicklook/L0CS/SpWx/'
+               '{{year:4d}}/{{year:4d}}{{month:2d}}{{day:2d}}_EVE_L{{Level:1d}}CS_DIODES_1m.txt')
 
     @property
     def info_url(self):

@@ -49,8 +49,8 @@ class GBMClient(GenericClient):
 
     """
 
-    baseurl = r'https://heasarc.gsfc.nasa.gov/FTP/fermi/data/gbm/daily/%Y/%m/%d/current/glg_(\w){5}_(\w){2}_%y%m%d_.*\.pha'
-    pattern = '{}/daily/{year:4d}/{month:2d}/{day:2d}/current/glg_{Resolution:5}_{Detector:2}_{:6d}{}'
+    pattern = ('https://heasarc.gsfc.nasa.gov/FTP/fermi/data/gbm/daily/{{year:4d}}/{{month:2d}}/{{day:2d}}/current/'
+                'glg_{{Resolution:5}}_{{Detector:2}}_{{year:2d}}{{month:2d}}{{day:2d}}_v00.pha')
 
     @property
     def info_url(self):