Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osm-analytics for OC Africa #79

Closed
tyrasd opened this issue May 11, 2018 · 5 comments
Closed

osm-analytics for OC Africa #79

tyrasd opened this issue May 11, 2018 · 5 comments

Comments

@tyrasd
Copy link
Collaborator

tyrasd commented May 11, 2018

work plan part 1 – priorities and limitations

This describes the possible directions for continuing to go forward in the development of the osm-analytics tool, as well as limitations of the software stack of the existing prototype.

A longer list of features is found in this spreadsheet.

OSMA main principles

osm-analytics turned out to be useful general purpose tool, for example for showcasing OSM mapping activity to the general public amongst other use-cases. Some high level principles made this possible and should (if possible) be kept in future versions of osm-analytics:

  • global coverage
  • frequent / regular updates
  • "fast" and responsive user interface

priorities for new features

The overall goal in the osm-analytics for OpenCities Africa is to help establish mapping plans (e.g. find out what feature/attributes to map in a specific region). This can be split up into different tasks:

  • identify gaps (by comparing OSM with external datasets, e.g. population density data)
  • explore community dynamics (mapping productivity trends, new mappers, core contributors)
  • expand osm-analytics coverage (feature types, attributes)

linkage to existing work

work packages

a) gap identification

This is essentially about exploring OSM's data (feature or attribute) completeness. Different methods exist to perform this for a number of specific subsets of OSM features. Extrinsic methods also depend on how the external reference data set is structured (e.g. aggregated data, non-georeferenced data, …) and does in general require a data normalization step (e.g. associating attributes to OSM tags).

We need to narrow down and define the scope (or few number of scopes) that we actually want to cover:

  • extrinsic and/or intrinsic method
  • kind of osm feature
  • kind of reference dataset

Currently, this is not covered by the osm-analytics framework at all, so both a user interfaces as well as a technical implementation have to be elaborated.

b) community dynamics

osm-analytics already allows limited insight into OSM contribution dynamics by displaying OSM objects' latest edit time and "mapper experience" for a given region.

These features can be expanded to better identify mapping productivity patterns, find out about mapper recruitment and retention frequencies, etc.

c) expanding osm-analytics "coverage"

Improving coverage and accuracy of results generated by osm-analytics is important since the original osma prototype turned out to be limited in a few aspects:

  • temporal: osm-analytics uses coarse (yearly) osm-qa-tiles snapshots for historic development. Often more fine-grained results are needed (use full osm history instead).
  • osm objects can be contained in osm-qa-tiles multiple times (if they lie on the border of a tile), which can produce too high counts in the results.
  • some osm object types are not contained in osm-qa-tiles (e.g. multipolygon relations or other relation types like turn restrictions, etc.).
  • OSMA layers are limited to certain pre-defined osm features and cannot be filtered by additional properties easily.
  • the cruncher is somewhat resource-hungry (because of the daily reprocessing) and still results are produced relatively slow (considering the delay between OSM contributions and them being visible in OSMA)

It might be worth rethinking parts of the system design to make osm-analytics future proof.

@cgiovando
Copy link
Collaborator

To expand on @tyrasd's draft workplan above, I'm going to provide some more details about the first two parts we're working on, specifically gap identification and expanding OSMA coverage.

This work is supported by GFDRR as part of the Open Cities Africa project, a new phase of the Open Cities initiative, set to kickoff in mid-June 2018 across selected cities in Africa.

We envision extending OSMA functionalities to provide project teams in each city the ability to easily analyze map data and the local OSM community dynamics. For example, by adding a gap analysis tool to OSMA, a team can easily identify areas of the city which are missing in OSM, and decide to start a new mapping project.

Working prototypes of the gap analysis tool and expanded OSM types will be available in time by the kickoff meeting in Kampala, the week of June 11-15, 2018. Other functionalities will be added later, focusing first on understanding community dynamics, then other analytical needs as listed in this spreadsheet.

OSM Gap Analysis

Determining completeness of OSM data is challenging and normally requires some external reference datasets to compare to (extrinsic method mentioned by @tyrasd above). From #43 to the recent osma-health implementation of Azavea's OSM vs. WorldPop, the idea has been to use population estimates as a proxy of where OSM buildings should exist, and in what amounts.

Stepping back to just identifying gaps, or simply being able to compare selected OSM features with any external dataset for the same area, we propose an approach based on the current OSMA stack, and in-browser computations between two or more vector-tile values.

This tool should be generic enough to allow for inputing any available dataset (global or local) in raster/grid format (or already in vector tile), and selected specific OSM features/tags combinations to compare to. In-browser (client-side) computations such as simple value difference between vector tiles is achieved with javascript libraries (e.g. http://turfjs.org).

To prototype this tool we plan the following:

  • Use the Global Human Settlement Layer (GHSL) global built-up presence, derived from Landsat image collections by JRC, as the external reference dataset;
  • Convert the GHSL layer to vector tile format, and aggregate values at each lower zoom level using the same osma-cruncher workflow;
  • Add a function/tool in the OSMA interface that allows for comparison of OSM building density with the GHSL layer, and return a color scaled rendering of the results (dynamically).

Extend OSM Data Coverage

At the moment OSMA only provides total counts and density visualization of buildings, and length of roads and rivers. To support analytical needs of the Open Cities Africa projects, and in general, to make it useful to anyone wanting to explore distributions of other OSM features (see #14, #30, #52), we plan to partially refactor the OSMA cruncher and frontend to allow (authorized) users to request any OSM tag combination. For example, in addition to the current building geometry (building=yes) density, we would also like to display only those buildings that have an associated address (addr:street=* plus addr:housenumber=*)

There are limitations and challenges to the planned approach:

  • Layers and new tag combinations cannot be requested interactively from frontend-users, as all data in OSMA has to be first processed (daily) by the cruncher;
  • The number of OSM tags combinations can potentially be unlimited;
  • Management of layers in the OSMA frontend drop-down menu can become problematic when having more than 10-20 layers. This could be improved by thematic groups and expansible sub-menus, or with a search-and-add box;
  • If allowing display of more than one layer at a time, some cartographic precautions have to be taken to ensure overlapping density color-scales can be distinguished;
  • Not all metadata (e.g. creation time) and elements (e.g. relations) are available in OSMA QA Tiles, the source dataset for OSMA;

To mitigate these issues we plan to set up OSMA with a simple YAML configuration file that will automatically instruct the cruncher about what features to process, and then rebuild the frontend drop-down menu dynamically to show those additional layers.

The YAML file will sit in the public OSMA github repository where users can send their pull requests specifying additional OSM tags to be processed, and then an OSMA instance maintainer will process requests as needed. This workflow is intended to be a temporary solution, before a more structured osma-admin module with a proper user interface and authentication, is developed.

Adding more features to the cruncher daily routine will increase processing time, but with the assumption that buildings and roads (already in the cruncher process) are by far the most used tags in OSM, then we don't expect the process time to increase linearly.

@tyrasd please review and add any technical information that can better explain this work - thanks!

@smit1678
Copy link
Collaborator

smit1678 commented May 25, 2018

Thanks for the summary @tyrasd @cgiovando. Couple follow up questions:

Layers and new tag combinations cannot be requested interactively from frontend-users, as all data in OSMA has to be first processed (daily) by the cruncher;

Would it make sense to start to position Batch based processing to move from one-off daily crunching to more dynamic, as-needed crunching? This way we could build off the direction that was started with the osm-health-workers to reduce cost and processing time. Right now this is based off of just json files that can be added as a PR that then kicks off a new analysis. Extending to YAML and other tagging would be great. This might also help tackle the increased processing time and invest it more cost-effective methods of providing these analytics layers more dynamically.

Add a function/tool in the OSMA interface that allows for comparison of OSM building density with the GHSL layer, and return a color scaled rendering of the results (dynamically).

Sounds good and would pair well with the other analysis.

Adding more features to the cruncher daily routine will increase processing time

What's the tipping point for whether we invest further in the lower stack (improving QA tiles and other OSMA foundations) to reduce processing time or re-architect the processing entirely?

@awright
Copy link
Collaborator

awright commented May 27, 2018 via email

@tyrasd
Copy link
Collaborator Author

tyrasd commented Jul 3, 2018

quick status update:

OSM Gap Analysis

A prototype is running here: http://129.206.7.145:7778/#/gaps - used reference data set is http://ghsl.jrc.ec.europa.eu/ghs_bu.php

Extend OSM Data Coverage

Cruncher refactoring done, see pull request hotosm/osm-analytics-cruncher#18 – Front-end part is still work in progress.

//edit:

also I've started a work in progress documentation about the overall architecture of the osm-analytics components here: #82

@tyrasd
Copy link
Collaborator Author

tyrasd commented Nov 7, 2018

osm-analytics for OCA

final report 2018-11-07

1. Features

1. a) Gap Detection

The newly introduced "gap detection" analysis tab in osm-analytics allows one to assess certain aspects of the completeness of OpenStreetMap data. In particular, it can be used to detect areas where there exist gaps in the coverage of certain feature types.

This is achieved by comparing one of the "standard" osm-analytics feature layers (e.g. the building layer) with an external reference data set after it was converted into a vector-tiles format compatible with the osm-analytics feature layer schema. This is an extrinsic data quality measure.

On osm-analytics.org, the gap-detection tab offers one layer which compares buildings as mapped in OpenStreetMap with the "built-up" layer from the Global Human Settlement Layer project. More layers can be added in the future.

This analysis mode also features the possibility to switch between different background maps (e.g. a basic road map, different aerial imagery layers, etc.) which can be useful to verify whether detected gaps are actually missing features in OSM or false-positives in the reference data set (note that for example in the screenshot above, the tool shows large "gap" areas in the oceans, where obviously no buildings exist in reality).

1. b) new layer: amenities

Improvements in the cruncher and the osm-analytics frontend make it possible to also process and render point data, like the centroids of OSM objects tagged as amenity:

1. c) new "distinct tags" analysis

This data analysis provides in-depth information about the finer grained mapping structure of OSM objects, specifically regarding the distribution of used tag values. It can be used to get an idea how uniform or diverse the tagging on specific feature type is in the given analysis area.

1. d) low zoom estimates for number of contributors

Because of the nature of the osm-analytics vector tiles, it is not possible to determine exact values for the number of contributors in a given area. This is because of the lack of information about the full edit history in the base data used by osm-analytics, but also due to the fact that for low zoom levels, osm-analytics uses a sampling approach to keep the data stored in osm-analytics' vector tiles manageable. Working around these limitations – but still providing insight into the contributor base of a given area of interest – osm-analytics now displays a lower bound of the number of contributors also in low-zoom analysis situations (e.g. a large city):

selection_120

1. e) load analytics boundary from geojson URL

To improve reusability of the osm-analytics framework for a larger spectrum of applications, it features now a new mode to load areas of interest which can load the geometry data in the standardized GeoJSON format from an open external data source (i.e. github gists). This makes it easier to link from external applications to osm-analytics, thus improving interoperability.

2. code refactoring

2. a) integration with HOT tasking manager v3 API

osm-analytics now consumes data directly from the (relatively) new HOT tasking manager version 3, via its public API. Previously, osm-analytics relied on custom osma-specific pre-processing of the HOT tasking manager v2 API data, which was cumbersome and likely to fail. The new v3 API is more reliable, dynamic and can be accessed publicly from any web applications. This made it finally possible to get rid of old ballast in the code.

2. b) cruncher reworked from ground up

The osm-analytics backend cruncher was improved in several ways: one is that the data processing flow was altered in a way that now — instead of one full processing job for each osm-analytics feature layer – one can generate all requested feature layers at once. This saves a lot of computation power which is then available to offer more feature layers (e.g. such as the amenities layer described above).

2. c) separate job definition files

The cruncher now also reads a job definition file that can be maintained separately from the cruncher code itself. This makes it easier to add, remove or modify the list of feature layers. Also the osm-analytics client consumes this list in order to dynamically update the list of layers that is presented to the end user.

https://github.com/hotosm/osm-analytics-config

2. d) filled gaps and fixed wrong data in historic snaphots

In the past, because of oversight and/or lack of maintenance, the historic snapshots of the feature layers (used in the compare time periods analysis mode) were not updated, had gaps and occasionally even contained wrong or incomplete data. This issue was fixed by re-calculating the whole set of historic snapshots for all current feature layers. These corrected and complete set of historic snapshots is now live on osm-analytics.org to finally produce accurate results.

3. prototypes, upcoming

3. a) integration with external OSM history statistics providers

Some aspects of OpenStreetMap data analysis are either hard to perform or even impossible when using the base data osm-analytics works on exclusively right now. Most notably is the lack of full historic data coverage in the osm-qa-tiles that are consumed by the osm-analytics cruncher, but similar issues arise when one tries to analyse more complex OSM features like large (multi)polygons or other types of complex relations.

By requesting data from external services and/or consuming additional data-sets, some of these deficiencies can be overcome. For example, the services hosted at ohsome.org allow one to inspect the historic development of OpenStreetMap data much more in-depth that what osm-analytics usually can do by itself. For example, it can be used to produce much more fine-grained temporal resolution for the historic development of the amount of features over time. The following shows a prototype example where the output from ohsome.org's API (solid line) is compared with the "normal" output of osm-analytics.org (dashed line):

Similarly, the ohsome framework can also be used to get more precise information about the number of contributors in a given area. Another use-case that requires full-history OSM data is to produce more useful graphs of edit activity over time (which could replace the currently shown recency of edits graph on osm-analytics.org).

@tyrasd tyrasd closed this as completed Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants