Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fdsn client get_availability? #3002

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

tcths
Copy link

@tcths tcths commented Mar 22, 2022

Hello,

I am wondering if there is any interest in including the fdsn availability service in the retrieval capabilities of the fdsn client. I notice that the earthworm client has a get_availability method, and that get_availability and get_availability_extent methods are included in the API for obspy.clients.filesystem.tsindex.Client.

I have in mind something similar here, although I notice that the JSON retrieval capability (for example, https://service.iris.edu/fdsnws/availability/1/query?network=IU&station=ANMO&channel=BHZ&format=json) is already very handy. The get_availability methods mentioned above seem to prefer a list(tuple) return value.

I notice also some previous discussion here and here and here, but am not certain of the relationship between that and this.

PR Checklist

  • Correct base branch selected? master for new features, maintenance_... for bug fixes
  • This PR is not directly related to an existing issue (which has no PR yet).
  • If the PR is making changes to documentation, docs pages can be built automatically.
    Just add the "build_docs" tag to this PR.
    Docs will be served at docs.obspy.org/pr/{branch_name} (do not use master branch).
    Please post a link to the relevant piece of documentation.
  • If all tests including network modules (e.g. clients.fdsn) should be tested for the PR,
    just add the "test_network" tag to this PR.
  • All tests still pass.
  • Any new features or fixed regressions are covered via new tests.
  • Any new or changed features are fully documented.
  • Significant changes have been added to CHANGELOG.txt .
  • First time contributors have added your name to CONTRIBUTORS.txt .
  • If the changes affect any plotting functions you have checked that the plots
    from all the CI builds look correct. Add the "upload_plots" tag so that plotting
    outputs are attached as artifacts.
  • Add the "ready for review" tag when you are ready for the PR to be reviewed.

@filefolder
Copy link
Contributor

this is definitely something i would like to see; actually i thought this already existed

@tcths
Copy link
Author

tcths commented Mar 10, 2022

I'd like to try to do this. There is a beginning here: master...tcths:fdsn-client-getavailability

@megies
Copy link
Member

megies commented Mar 10, 2022

👍 Feel free to open a PR right away, so people can give feedback

@megies megies added this to the 1.4.0 milestone Mar 10, 2022
@tcths
Copy link
Author

tcths commented Mar 22, 2022

PR opened; comments welcome

@tcths tcths marked this pull request as draft March 22, 2022 17:24
@trichter trichter added the test_network tell github actions to also run network tests for this PR label Mar 22, 2022

availability = self._download(url, return_string=True).decode()
lines = [line.split() for line in availability.strip().split('\n')[1:]] # skip header
extents = [(line[0], line[1], line[2], line[3], UTCDateTime(line[6]), UTCDateTime(line[7])) for line in lines]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think how the return type looks like is one of the most important things to decide, but there might not be more sophisticated ways to return this I guess, so this simple structure might be fine

Copy link
Author

@tcths tcths Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I am looking at this and also at the API specification here as examples.

Relatedly, I notice that there are optional parameters in the specification ('merge', 'show') that change the number of fields included in the results from FDSN.

@megies
Copy link
Member

megies commented Mar 23, 2022

I have two ocmments, above. Otherwise this looks pretty much ready, besides the many PEP8 fails, missing changelog etc

@tcths
Copy link
Author

tcths commented Mar 24, 2022

a list of unimplemented functionality possibly tbd:

  • writing to file is not implemented
  • optional parameters (merge, show) that change resultset length are not implemented
    • should they be implemented?
    • does this require a more 'sophisticated' return type?
  • extent endpoint request is not implemented
    • note: 'show' and 'mergegaps' optional params are defined only for query endpoint
  • optional queryauth and extentauth endpoint requests are not implemented
    • probably include_restricted FDSNWS parameter is only relevant if these are implemented?

@megies
Copy link
Member

megies commented Mar 30, 2022

a list of unimplemented functionality possibly tbd:

* writing to file is not implemented

Do we even need that? It's just tuples of strings for the most part, and its easy for people to write it out should that need ever arise. If you are thinking of e.g. get_waveforms() having that option, there is more profound reason there, because it avoids reading+writing the MSEED data, which causes quite some low level information on miniseed block level to get lost.

* optional parameters (merge, show) that change resultset length are not implemented
  
  * should they be implemented?

So, I had a look at those optional parameters, and I saw that the availability service actually has two endpoints "query" and "extent". It looks like you only use "query", so it would be good to discuss how we treat this part. It looks like "extent" is just for a general overview, basically more or less the same as what "station" webservice would return in textual form? If thats the case we can probably ignore the "extent" endpoint and just keep as is, only using the "query" endpoint?

  * does this require a more 'sophisticated' return type?

Personally, for me it would be OK to have the return type vary depending on what options are specified, e.g. having more fields in the tuples when asking for the "last update time".
On the other hand, it happened before that I was thinking something like "would be nice to have something like a UTCTimeSpan object, with a start and end time and funtionality for merging etc". But that might be a much bigger endeavor and would need a lot of thinking through carefully.

* extent endpoint request is not implemented

Ah, should've read to the end first.. see above. To be honest, I'm quite confused by this "extent" thing. It seems to me it should be the same as "station" query with "matchtimeseries=true"? To me this "extent" thing seems like an obscure middle ground between going full detail with "availability/query" or going pure metadata with "station/query" (and what is that timespan with sampling rate 0.0??). I don't know why I would wanna use it, but if anybody sees reason for it, other opinions would be good to hear. Seems like bad design by FDSN working group having multiple ways to try and do the same..

  • station/query w/ and w/o includeavailability and/or matchtimeseries
  • availability/extent
  • availability/query

http://service.iris.edu/fdsnws/availability/1/extent?network=IU&station=ANMO&channel=BHZ&location=00

#Network Station Location Channel Quality SampleRate Earliest Latest Updated TimeSpans Restriction
IU ANMO 00 BHZ M 0.0 2002-08-28T18:17:51.000000Z 2008-05-23T23:09:24.000000Z 2017-12-06T03:42:35Z 9051 OPEN
IU ANMO 00 BHZ M 20.0 1998-10-26T20:35:58.310050Z 2018-07-09T20:45:47.369538Z 2021-09-10T07:10:56Z 2238 OPEN
IU ANMO 00 BHZ M 40.0 2018-07-09T20:46:40.594538Z 2022-03-28T23:59:59.994538Z 2022-03-29T06:56:15Z 24 OPEN

http://service.iris.edu/fdsnws/station/1/query?network=IU&station=ANMO&channel=BHZ&location=00&level=channel&format=text&matchtimeseries=true

#Network | Station | Location | Channel | Latitude | Longitude | Elevation | Depth | Azimuth | Dip | SensorDescription | Scale | ScaleFreq | ScaleUnits | SampleRate | StartTime | EndTime
IU|ANMO|00|BHZ|34.9459|-106.4572|1700.0|150.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|8.64679E8|0.02|m/s|20.0|1998-10-26T20:00:00.0000|2000-10-19T16:00:00.0000
IU|ANMO|00|BHZ|34.9502|-106.4602|1743.0|96.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|8.64679E8|0.02|m/s|20.0|2000-10-19T16:00:00.0000|2002-11-19T21:07:00.0000
IU|ANMO|00|BHZ|34.945981|-106.457133|1671.0|145.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|8.11548E8|0.02|m/s|20.0|2002-11-19T21:07:00.0000|2008-06-30T00:00:00.0000
IU|ANMO|00|BHZ|34.945981|-106.457133|1671.0|145.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|8.1872E8|0.02|m/s|20.0|2008-06-30T00:00:00.0000|2008-06-30T20:00:00.0000
IU|ANMO|00|BHZ|34.945981|-106.457133|1671.0|145.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|3.27511E9|0.02|m/s|20.0|2008-06-30T20:00:00.0000|2011-02-18T19:11:00.0000
IU|ANMO|00|BHZ|34.945981|-106.457133|1671.0|145.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|3.27511E9|0.02|m/s|20.0|2011-02-18T19:11:00.0000|2012-03-12T20:28:00.0000
IU|ANMO|00|BHZ|34.945981|-106.457133|1671.0|145.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|3.27511E9|0.02|m/s|20.0|2012-03-12T20:28:00.0000|2014-12-17T18:40:00.0000
IU|ANMO|00|BHZ|34.94591|-106.4572|1671.0|145.0|0.0|-90.0|Geotech KS-54000 Borehole Seismometer|3.40413E9|0.02|m/s|20.0|2014-12-17T18:40:00.0000|2018-07-09T20:45:00.0000
IU|ANMO|00|BHZ|34.94591|-106.4572|1632.7|188.0|0.0|-90.0|Streckeisen STS-6A VBB Seismometer|1.98475E9|0.02|m/s|40.0|2018-07-09T20:45:00.0000|
  * note: 'show' and 'mergegaps' optional params are defined only for query endpoint

* optional queryauth and extentauth endpoint requests are not implemented
  
  * probably include_restricted FDSNWS parameter is only relevant if these are implemented?

Probably.. why would there be authenticated requests if it showed restricted stations in unauthenticated requests. But on server side they sure could have two layers of obscurity, i.e. have some restricted stations show info on unauthenticated requests but others not.

queryauth should be trivial to add, tbh. All the authentication is already handled.

1446     def _build_url(self, service, resource_type, parameters={}):
1447         """
1448         Builds the correct URL.
1449 
1450         Replaces "query" with "queryauth" if client has authentication
1451         information.
1452         """
1453         # authenticated dataselect queries have different target URL
1454         if self.user is not None:
1455             if service == "dataselect" and resource_type == "query":
1456                 resource_type = "queryauth"
1457         return build_url(self.base_url, service, self.major_versions[service],
1458                          resource_type, parameters,
1459                          service_mappings=self._service_mappings,
1460                          subpath=self.url_subpath)

Just have to change line 1455

1455             if service in ("dataselect", "availability") and resource_type == "query":

@tcths
Copy link
Author

tcths commented Apr 19, 2022

One possible use case for the availability/extent endpoint is to obtain a list of streams for which availability information is available. I suppose this could be inferred from the station/query endpoint but the availability/extent seems to provide the information a little bit more cleanly and closer to the source of truth. On the other hand to rely on the availability/query endpoint for this information could be very inefficient in the case, for example, where a client wants to provide a list of stations and allow a user to select from them for full availability information. This is just what I have observed, I do not speculate concerning the intent :)

Regarding writing to file, I agree that if there is not a compelling reason to include that in the get_* method then it is nicer to exclude it.

So the list of things to do:

  1. (possibly) implement extent endpoint request (discussion still open)
  2. implement optional parameters
    1. query endpoint
      1. show
      2. merge
    2. extent endpoint
      1. merge
    3. return types (discussion still open)
      1. option: a sequence of variable length sequences (e.g. a list of tuples)
      2. option: a sequence of UTCTimeSpan objects
  3. implement auth endpoints ( should be trivial, as noted above)

We could go ahead and implement all this with list of variable length tuples as the return type and then see if there is time available to think about the UTCTimeSpan objects? Even if we eventually decided to move forward with the UTCTimeSpan objects there would not be a lot of wasted effort, a few unit tests maybe.

@megies
Copy link
Member

megies commented Apr 25, 2022

  1. (possibly) implement extent endpoint request (discussion still open)

I had another look, how about the following.. add a kwarg details=True and use endpoint query by default and fall back to extent endpoint when setting details=False. That's the impression I get from these endpoints semantically, in any case.

It seems that all extent does is merge timespans that have the same SEED ID (but have some instrument changes or sampling rate changes etc.) and it adds three fields "Updated TimeSpans Restriction".

  1. implement optional parameters

I think this can be unified like above, e.g. we could just raise an exception if orderby gets one of these values only specified for extent endpoint when requesting details=True. Same with mergegaps and show, just raise an exception if these are used with details=False.

iii. return types (discussion still open)

I see three options..

  • plain tuples.
    • positive: KISS
    • negative: variable length return types are kinda ugly
  • namedtuple.
    • positive: simple, but fields can be used by their name. additional fields from extent / "details=False" could be handled the same for both result types (with just defaults of None in the one case)
    • negative: if we at some later point decide to have a time span object, we can not replace the result type of these methods without breaking peoples' codes
  • create a TimeSpan object right now
    • positive: most canonical, additional functionality (like merging time spans etc) could be added later without breaking peoples' codes
    • negative: if we overlook any basic concepts needed for this new object, we might set it in stone in a way that we need to change later on (breaking peoples' codes), but if we start it out real simple we should be able to expand on it later
from obspy import UTCDateTime
import collections

Availability = collections.namedtuple(
    'Availability',
    ['network', 'station', 'location', 'channel', 'quality', 'sampling_rate',
     'earliest', 'latest', 'updated', 'time_spans', 'restriction'],
    defaults=(None, None, None))

line = 'IU ANMO 00 BHZ M 20.0 1998-10-26T20:35:58.310050Z 1998-10-26T20:37:31.610050Z'
items = line.split()
items[5] = float(items[5])
items[6] = UTCDateTime(items[6])
items[7] = UTCDateTime(items[7])

x = Availability(*items)

I'm kinda thinking maybe we should just add a UTCTimeSpan object right now and just keep it as simple as possible for now so we can keep working on it later on..?
We could have the results be a list of such objects and even later if we decide we want more functionality on top we could replace it with some object type UTCTimeSpans(list) without breaking things.

class UTCTimeSpan(object):
    def __init__(self, start=None, end=None):
        self.start = start
        self.end = end
        
    @property
    def start(self):
        return self._start

    @start.setter
    def start(self, value):
        if value is None:
            self._start = None
        else:
            self._start = UTCDateTime(value)

    @property
    def end(self):
        return self._end

    @end.setter
    def end(self, value):
        if value is None:
            self._end = None
        else:
            self._end = UTCDateTime(value)

We could a) use that directly and just set fields as needed for the return types or b) go all out and define some class Availability(UTCTimeSpan) right away, defining all the fields in it. Either way would be fine I think and if we start simple (a) we could expand to option (b) later on without breaking things, most likely.

Other opinions? Maybe @trichter?

@megies
Copy link
Member

megies commented Nov 15, 2022

Tempted to bump this to next version, to speed up getting 1.4.0 out.. I think there was still some things unclear about how to do some of the implementation? What so you think @tcths? anybody wanna offer opinions on above implementation discussion?

@iandkelly
Copy link

Any updates on this? Following with interest!

@obspy-bot
Copy link
Contributor

This pull request has been mentioned on ObsPy Forum. There might be relevant details there:

https://discourse.obspy.org/t/determining-station-data-availability-from-inventory/1895/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.clients.fdsn test_network tell github actions to also run network tests for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants