Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNIP 100: Assets #12124

Open
1 of 5 tasks
etj opened this issue Apr 2, 2024 · 10 comments · May be fixed by #12179
Open
1 of 5 tasks

GNIP 100: Assets #12124

etj opened this issue Apr 2, 2024 · 10 comments · May be fixed by #12179
Assignees
Labels
gnip A GeoNodeImprovementProcess Issue

Comments

@etj
Copy link
Contributor

etj commented Apr 2, 2024

GNIP 100 - Assets

Overview

We need a way to identify files (local, remote, in the cloud...) per se.
There's no way at the moment to identify data files by themselves, which are only referenced by the field `ResourceBase.files'.

Also, the StorageManager is pluggable, but only allows for a single storage backend at once.
By having different subclasses of Asset (e.g. LocalAsset, S3Asset, ...) we may have a GeoNode instance handling datafiles on different data store backends.

Proposed By

  • @etj - Emanuele Tajariol

Assigned to Release

This proposal is for GeoNode 4.3 (?)

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

  • Data may be stored on different backends
  • Data is not coupled with a single ResourceBase
  • May simplify the handling of basic data (images, etc) in GeoStories, which at the moment use full featured (and heavy) ResourceBases metadata also for simple files.
  • Allows the possibility to link a single ResourceBase with multiple data files (think for instance about a Document having multiple PDF files for different languages).
  • Allows the definition of a directory hierarchy as a single data asset, making it possible to publish complex data.
  • Simplifies/improves handling of authorization for data access.

Proposal

We introduce the concept of Asset as generic data, that may be linked to a ResourceBase.
A LocalAsset represents data stored in the filesystem (either a single file or a directory tree).

The Asset class will replace and augment the information stored at the moment in the ResourceBase.files field.

An Asset is associated with a Resource through a Link, which also tells the URL through which the Asset will be available to the GeoNode users.

  • For a LocalAsset, the URL may be a service that checks for authorization before returning the data.
  • For a RemoteAsset (we're not discussing its implementation in this proposal), it could be a replication of the RemoteAsset URL, or also a GeoNode local endpoint that proxies a remote content, maybe providing some authentication to the remote service -- this is just an example, we don't have any real use case at the moment.

Other usages for assets

Since the Asset object is quite simple, we could use it for other purposes as well; for instance, at the moment we use "unadvertised" ResourceBase instances for providing simple data to GeoStories (images, PDFs, ...). Instead of using such a heavy object, we could just use LocalAssets for this purpose.

Also, more Assets may be associated with an existing ResourceBase; this behavior replicates what GeoNetwork is already doing, that is having multiple data resources pointed by a single metadata record.

Permissions

In the future there could be different permissions for a Resource and its linked Assets, anyway for the sake of simplicity, as a first step we may grant on the asset the very same permissions of the linked ResourceBases.

In the case we want to associate an Asset to more than one Resource, the Asset will be available if the user has download privileges on at least one of the associated Resources.

Implementation

GeoNode asset diagram vpd

Model:

  • Add Asset class, and its subclass LocalAsset
  • Refact ResourceBase and Link classes

Logic:

  • Replace the usage of files with Asset instances

DB migration:

  • Documents:
    • migration is straightforward, since there's a single file
  • Datasets:
    • Single file --> Asset
    • Shapefile (shp+dbf+shx+prj?) or other multifile formats--> ?
      • Multiple options according to Asset type:
        • create a zipfile ?
        • publish as the root entry of a directory? (There are cases where we want an Asset to be a directory with subdir)
        • Every Assethandler should be able to create its own Asset content and related link url
          • shp file should create a zip file, since such data is only used for download
      • What if there is a SLD (or any other satellite file)? is it a separate Asset?

API:

  • Make sure existing API still returns file info as before

Authorization

A user has access to an Asset data iff such Asset is associated with at least one ResourceBase for which the user has download permissions.

Backwards Compatibility

  • API: old files array can be preserved in output

Future evolution

Decoupled uploads

A user may upload an Asset without having to associate it to a Resource.
Unassociated Assets may be used to automatically create ResourceBases and attach the asset to them.

Deprecate Documents

Once Assets gain their characterization, the Document object will not have much of a meaning, also considering that users upload as a Document any object that is not published as a Layer.
This means that we will be able to remove the Document class, and convert its instances into ResourceBases with an Asset handling the former document's data.

Feedback

Update this section with relevant feedbacks, if any.

Voting

Project Steering Committee:

  • Alessio Fabiani:
  • Francesco Bartoli:
  • Giovanni Allegri: +1
  • Toni Schoenbuchner: +1
  • Florian Hoedt:

Links

Remove unused links below.

@etj etj added the gnip A GeoNodeImprovementProcess Issue label Apr 2, 2024
@etj
Copy link
Contributor Author

etj commented Apr 2, 2024

Refactoring data upload procedure

The initial Asset implementation could be completely hidden to the GeoNode user, since the changes are only applied on the backend logic.

When a user uploads some data, the original data will be saved as an Asset.

Then, some heuristic will find the type of the uploaded data:

  • If data is related to geographical data, a Dataset will be automatically created (so the implementation changes will be invisible to the end user)
  • If data is a doc, png, image etc, a Document will be created

When the Resource is created, the Asset pointing to the uploaded data is linked to the ResourceBase via a Link (the new nullable asset foreign key shall be added).
Examples:

  • If it's a geographic data (shp, tiff, ...) it will be added as a "original data" link (saving such original data could be enabled or disabled via settings)
  • Documents should already be stored in a similar way

We may need to split the logic

  1. Upload data and create an Asset
  2. Create a Resource from an existing Asset
    In this way, once we handle unassociated Assets, we may be able to run the creation of the related Resources in unattended commands.

@etj
Copy link
Contributor Author

etj commented Apr 2, 2024

Authorization

An improvement comes for free with the Asset refactoring: at the moment downloadable files are public: if a URL for a data resource leaks out from someone having access to the Resource, such URL can be used by anybody to download the data file.

By checking the authorizations for the URL accessing the Assets' data, we'll add protection to the published data, allowing the download only to users having access to the Resource.

@gannebamm
Copy link
Contributor

I generally like the idea of assets and forming a resource out of multiple assets. This was also discussed beforehand in a research data infrastructure group and we thought about using the Research Object Crate (RO-Crate) concept or the Annotated Research Object (ARC) concept as our 'assets'. It was just a short discussion and we have yet to do anything in terms of how to incorporate it into the GeoNode architecture. But these parts of the mentioned Motivation are of particular interest to us as research institutes:

Motivation
[...]

  • Allows the possibility to link a single ResourceBase with multiple data files (think for instance about a Document having multiple PDF files for different languages).
  • Allows the definition of a directory hierarchy as a single data asset, making it possible to publish complex data.

Here is an excerpt of the brief discussion the research infrastructure group had regarding RO-Crates. It is a bit outdated, but I think you get the gist of it:


Looking at other data portals like CKAN or OpenAgrar (based on MyCORE framework), you can describe a dataset which consists of multiple files/resources. Here are two examples:

https://demo.ckan.org/dataset/sample-dataset-1

https://www.openagrar.de/receive/openagrar_mods_00054877?lang=en

The latter example on a GeoNode instance:

https://atlas.thuenen.de/layers/geonode_data_ingest:geonode:bze_lw_standorte_verschleiert

Were the additional files are linked as documents:
grafik

There is a GeoNode developer workshop creating a so-called GeoCollection object to link multiple GeoNode ResourceBase objects together: https://docs.geonode.org/en/master/devel/workshops/index.html#create-your-own-django-app

My idea is to build on top of this concept and try to implement RO-Crate as a collection object: https://www.researchobject.org/ro-crate/

RO-Crates do use a metadata JSON to describe the Crate: https://www.researchobject.org/ro-crate/1.1/root-data-entity.html
In this JSON, datasets can be defined as web resources: https://www.researchobject.org/ro-crate/1.1/data-entities.html#web-based-data-entities

Most (all?) of the listed attributes of those datasets can be read by the GeoNode API for the bundled resources. Therefore, you only need to describe the ROCrate bundle itself.


I do not propose using RO-Crates as base implementation for assets! I just wanted to make clear that the underlying motivation is interesting for a part of the GeoNode community.

@gannebamm
Copy link
Contributor

@etj in the implementation ERD diagram, the link between RessourceBase (through Link) to Asset is shown as 0..1. Shouldn´t this be a 0..n since multiple Assets can form one RessourceBase?

In a settings file in which the storing of original data is disabled, there will be no Asset for a Dataset. In our workflow, datasets are often ingested via PostgreSQL directly and then registered with the updatelayers command. For those, there is also no Asset per se. Do you think this will be an issue? It is marked as 0..n, so it should be ok from a database model standpoint, but is it ok from a user's perspective?

@etj
Copy link
Contributor Author

etj commented Apr 4, 2024

@gannebamm, about cardinalities:
An Asset instance is an internal representation of files or data. Each Asset is presented to the external clients as a Link.
So

  • a Link is associated to 0..1 Asset. For instance, Links to WMS resources do not have associated Assets.
  • a ResourceBase is associated to 0..N Links (as already it is now)

Datasets or other ResourceBase can have no associated Assets at all, as in the case of Datasets only related to GeoServer layers.

@ridoo
Copy link
Contributor

ridoo commented Apr 12, 2024

@etj I like to idea making things more flexible here. I took some time to think about the GNIP and want to make some comments, also by sprinkling in questions and personal opinions. However, I cannot forsee what components and workflows (e.g. geonode-importer) have to be touched in the end.

Technical questions

By having different subclasses of Asset (e.g. LocalAsset, S3Asset, ...) we may have a GeoNode instance handling datafiles on different data store backends.

Does this mean that each asset has its own StorageManager/-Handler where actual download is being delegated to? Does this complement or even rescind the changes you did recently to the DownloadHandler?

What if there is a SLD (or any other satellite file)? is it a separate Asset?

To me, this is definetely an asset on its own which also could be applied to multiple resources. However, what about differentiating xml files which shall serve as an asset and those xml files to be interpreted as metadata file.

Backwards Compatibility
API: old files array can be preserved in output

I could not find a files field in the resource API. As far I can see, the ResourceBase.files includes the local paths to the files uploaded originally. Right now, it is unclear to me if these are used somewhere (besides extracting some metadata (e.g. exif) during the import process).

This means that we will be able to remove the Document class, and convert its instances into ResourceBases with an Asset handling the former document's data.

So ResourceBase is going to become a first class citizen and serves as a logical brace for simple assets, right?

From the end user perspective

  • Is an asset available from the catalogue? From your UML an asset can live without any resource. How assets are managed in GeoNode?
  • How "download" as we know it so far relates to "downloading" an Asset? Can we get the terms consice here. Should we differentiate "exporting" things (resource) from "downloading" (assets) things?
  • Taking your example of a Document referring to multiple translated PDF files: What does the user actually download when "downloading" such document (resourcebase)? A zipped file containing all related assets? Maybe the user should be able to select the actual asset(s) of interest.
  • How do you separate from assets which belong to the resource naturally from those linked later to it. There might be many linked assets over time, both those of interest and those not of interest at all. How do we enable the user to reason which may be of interest for her use case?

Opportunities

Besides those opportunities you mentioned already, I see the following:

  • As @gannebamm mentioned, the Asset concept would leverage to assemble download packages
  • Make the link ResourceBase ➡️ Asset semantically pluggable. This could be even extendable by uploaded ontologies.

@giohappy
Copy link
Contributor

@ridoo thanks for your comments. You touched on several points that we also included in out discussion. Many of them will probably come in the future, since the concept of Assets could bring a copernican changes to GeoNode in many ways...

Let me explain the current scope of this proposal first.
Assets will provide the foundation for many use cases. The ones we're facing now are:

  • publishing of 3D Tiles assets. We want to be able to "attach" a 3D tiles folder to a resource and serve it to the client, which already supports 3D tiles visualization
  • porting of an existing Geonetwork catalog, which also contains collections of files in single records

We're not going to cover the management of Assets from the GeoNode UI in this initial implementation. For the moment we only want to prepare the models to support present and future functionalities. We assume that the resource has been configured with assets in some way (DB operations, Django Admin, whatever).

  • an Asset could be selected as "primary" (maybe a better naming?) to indicate that it is the source this BaseResource represents (also for previews, etc.). We could also have resources without any "primary". In that case, the resource will only represent a collection of assets, and only the metadata detail page will be available. @etj could you provide more details about this aspect?
  • an Asset can live without a resource, yes in theory. BTW the initial implementation will transform the upload of a resource into the following steps, so the "free asset" won't be visible to end users:
    • a file is uploaded
    • an asset is created from this file (this is the status where the asset lives on its own)
    • a new resource (dataset, document) is created and the asset assigned to it
  • Assets will be downloaded from links that will be visible in a new tab of the info panel. Download of the resource itself will follow the current approach, for the moment (WPS for datasets, etc.), but we will have the option to restore the possibility to also download the original "primary" asset. This is one of the resource downloads API field was made an array ;)

@etj can you please confirm, correct, extend the points above?
I also agree with @ridoo that the point about the fields API should be clarified.

@ridoo
Copy link
Contributor

ridoo commented Apr 17, 2024

@giohappy @etj looking forward for some details.

You mentioned "copernican changes". To me, this sounds bigger than 'just' introducing an additional concept (here Asset). Just to stay curious: Is there more you have in mind?

@giohappy
Copy link
Contributor

giohappy commented Apr 17, 2024

You mentioned "copernican changes". To me, this sounds bigger than 'just' introducing an additional concept (here Asset). Just to stay curious: Is there more you have in mind?

Let's say that this is the first step that could bring to more important changes in the future. We don't have a roadmap, actually, but making the relation between Catalog resource and Data source could:

  • lighten the Resource model and its polymorphic derivatives
  • simplify the publishing of new real or virtual data sources

Regarding the details on the points discussed in your comment, we need to wait for @etj which is working hard these days to connect the dots and prepare more information to share :)

etj added a commit that referenced this issue Apr 22, 2024
etj added a commit that referenced this issue Apr 22, 2024
etj added a commit that referenced this issue Apr 23, 2024
etj added a commit that referenced this issue Apr 23, 2024
etj added a commit that referenced this issue Apr 23, 2024
etj added a commit that referenced this issue Apr 23, 2024
etj added a commit that referenced this issue Apr 24, 2024
etj added a commit that referenced this issue Apr 24, 2024
etj added a commit that referenced this issue Apr 24, 2024
@giohappy
Copy link
Contributor

@ridoo after reviewing with @etj the status of this PR (which is ready for review), we confirm that it neither changes not adds features to GeoNode for the moment. In terms of public APIs and functionality it behaves exactly the same as before, with files and the single local storage manager replaced by Assets and their specific storage and download managers.

The next steps will be the implementation of the "primary" asset concept and the multiplicity of assets that can be assigned/downloaded to/from a resource. It will come with a new GNIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gnip A GeoNodeImprovementProcess Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants