Skip to content

Latest commit

 

History

History
84 lines (59 loc) · 7.2 KB

aggregation.md

File metadata and controls

84 lines (59 loc) · 7.2 KB

Data Aggregation

Often, when querying OSM history data one is interested in getting multiple results at once that each refer to a certain subset of the queried data. For example, when querying for multiple timestamps, typically the result should be in the form of one result per timestamp.

The OSHDB API provides a flexible and powerful way to produce aggregated results that are calculated for arbitrary subsets of the data. This aggregateBy functionality also supports the combination of multiple such grouping functions chained after each other.

When executing any of the below listed aggregateBy methods, the query's MapReducer is transformed into a MapAggregator object which is (mostly) functionally equivalent to a MapReducer, with the difference that instead of returning single result values when calling any reduce method, an associative list of multiple values is returned instead: The result contains one entry for each requested grouping.

aggregateBy

This is the most generic grouping method, that allows to produce aggregated results that refer to arbitrary subsets of the input data. The aggregateBy method accepts a function that must return an “index” value by which the respective result should be grouped by. For example, when one wants to group results by OSM type, the aggregateBy method should simply return the OSM type value, as in the following example using the OSHDB snapshot view:

Map<OSMType, Integer> countBuildingsByType = OSMEntitySnapshotView.on(…)
    .areaOfInterest(…)
    .timestamps(…)
    .osmTag("building")
    .aggregateBy(snapshot -> snapshot.getEntity().getType())
    .count();

Optionally, the aggregateBy method allows to specify a collection of groups which are expected to be present in the result. If for a particular group, no matching OSM entities are found in the query, the result will then still contain this key, filled with a “zero” value (e.g. [] for a set).

For example, if the count reducer is used in a query, the result contains 0 integer values in entries for which no results were found. If instead the collect reduce method is used, empty lists are used to fill no-data entries.

    .aggregateBy(
        snapshot -> snapshot.getEntity().getType(),
        EnumSet.allOf(OSMType.class)
    )

aggregateByTimestamp

This is a specialized method for grouping results by timestamps. Depending on the used view, aggregating by a timestamp has slightly different meanings: In the OSMEntitySnapshotView, the snapshots' timestamp will be used directly to group results. In the OSMContributionView however, the timestamps of the respective modifications will be matched to the corresponding time intervals defined in the OSHDB query.

For example, when in a query the following three timestamps are set: 2014-01-01, 2015-01-01 and 2016-01-01, then a contribution happening at 2015-03-14 will be associated to the time interval between 2015-01-01 and 2016-01-01 (which is represented in the output as the starting time of the interval: 2015-01-01).

There are two variants that allow this grouping by a timestamp: aggregateByTimestamp tries to automatically fetch the timestamps from the queried data (i.e. the snapshot, or the contribution objects), while the second variant of aggregateByTimestamp takes a callback function that returns an arbitrary timestamp value. The second variant has to be used in some cases where the automatic matching of objects to its timestamps isn't possible, for example when using the groupByEntity option in a query, or when using multiple aggregateBys in a query.

aggregateByGeometry

Calculating results for multiple sub-regions of an area of interest at once is possible through aggregateByGeometry. It accepts an associative list of polygonal geometries with corresponding index values. The result will then use these index values to represent the individual sub-region results.

When using the aggregateByGeometry functionality, any OSM entity geometry that is contained in multiple sub-regions will be split and clipped to the respective geometries.

The given grouping geometries are allowed to overlap each other, but they should exactly match (i.e. fully cover and not protrude out of) the areaOfInterest of the query.

combining multiple aggregateBy

When writing an OSHDB query, it is possible to perform multiple of the above mentioned aggregateBy operations. For example, it is possible to write a query that returns results that are aggregated by timestamps and by OSM type. In this case, the final result will contain one entry for each possible combination of the specified groupings. These combined indices are encoded as OSHDBCombinedIndex objects in the final result map.

Map<OSHDBCombinedIndex<OSHDBTimestamp, OSMType>, Integer> countBuildingsByTimeAndType = OSMEntitySnapshotView.on(…)
    .areaOfInterest(…)
    .timestamps(…)
    .osmTag("building")
    .aggregateByTimestamp()
    .aggregateBy(snapshot -> snapshot.getEntity().getType())
    .count();

This map produces result data as a long list of entries with a complex key. Sometimes it is however easier to work with data in a more structured, nested form. The OSHDB API provides a helper method which can convert result data from the long format into the nested format:

SortedMap<OSHDBCombinedIndex<OSHDBTimestamp, OSMType>, Integer> flatCountBuildingsByTimeAndType = …;
SortedMap<OSHDBTimestamp, SortedMap<OSMType, Integer>> nestedCountBuildingsByTimeAndType = OSHDBCombinedIndex.nest(flatCountBuildingsByTimeAndType);
System.out.println(
    "building count at timestamp1 for ways: "
    + nestedCountBuildingsByTimeAndType.get(timestamp1).get(OSMType.WAY)
);

Chaining together more than two aggregateBy methods is also possible, which results in nested combined indices:

OSHDBCombinedIndex<OSHDBCombinedIndex<IndexType1, IndexType2>, IndexType3>