GNIP 100: Assets #12124

etj · 2024-04-02T14:39:24Z

GNIP 100 - Assets

Overview

We need a way to identify files (local, remote, in the cloud...) per se.
There's no way at the moment to identify data files by themselves, which are only referenced by the field `ResourceBase.files'.

Also, the StorageManager is pluggable, but only allows for a single storage backend at once.
By having different subclasses of Asset (e.g. LocalAsset, S3Asset, ...) we may have a GeoNode instance handling datafiles on different data store backends.

Proposed By

@etj - Emanuele Tajariol

Assigned to Release

This proposal is for GeoNode 4.3 (?)

State

Motivation

Data may be stored on different backends
Data is not coupled with a single ResourceBase
May simplify the handling of basic data (images, etc) in GeoStories, which at the moment use full featured (and heavy) ResourceBases metadata also for simple files.
Allows the possibility to link a single ResourceBase with multiple data files (think for instance about a Document having multiple PDF files for different languages).
Allows the definition of a directory hierarchy as a single data asset, making it possible to publish complex data.
Simplifies/improves handling of authorization for data access.

Proposal

We introduce the concept of Asset as generic data, that may be linked to a ResourceBase.
A LocalAsset represents data stored in the filesystem (either a single file or a directory tree).

The Asset class will replace and augment the information stored at the moment in the ResourceBase.files field.

An Asset is associated with a Resource through a Link, which also tells the URL through which the Asset will be available to the GeoNode users.

For a LocalAsset, the URL may be a service that checks for authorization before returning the data.
For a RemoteAsset (we're not discussing its implementation in this proposal), it could be a replication of the RemoteAsset URL, or also a GeoNode local endpoint that proxies a remote content, maybe providing some authentication to the remote service -- this is just an example, we don't have any real use case at the moment.

Other usages for assets

Since the Asset object is quite simple, we could use it for other purposes as well; for instance, at the moment we use "unadvertised" ResourceBase instances for providing simple data to GeoStories (images, PDFs, ...). Instead of using such a heavy object, we could just use LocalAssets for this purpose.

Also, more Assets may be associated with an existing ResourceBase; this behavior replicates what GeoNetwork is already doing, that is having multiple data resources pointed by a single metadata record.

Permissions

In the future there could be different permissions for a Resource and its linked Assets, anyway for the sake of simplicity, as a first step we may grant on the asset the very same permissions of the linked ResourceBases.

In the case we want to associate an Asset to more than one Resource, the Asset will be available if the user has download privileges on at least one of the associated Resources.

Implementation

Model:

Add Asset class, and its subclass LocalAsset
Refact ResourceBase and Link classes

Logic:

Replace the usage of files with Asset instances

DB migration:

Documents:
- migration is straightforward, since there's a single file
Datasets:
- Single file --> Asset
- Shapefile (shp+dbf+shx+prj?) or other multifile formats--> ?
  - Multiple options according to Asset type:
    - create a zipfile ?
    - publish as the root entry of a directory? (There are cases where we want an Asset to be a directory with subdir)
    - Every Assethandler should be able to create its own Asset content and related link url
      - shp file should create a zip file, since such data is only used for download
  - What if there is a SLD (or any other satellite file)? is it a separate Asset?

API:

Make sure existing API still returns file info as before

Authorization

A user has access to an Asset data iff such Asset is associated with at least one ResourceBase for which the user has download permissions.

Backwards Compatibility

API: old files array can be preserved in output

Future evolution

Decoupled uploads

A user may upload an Asset without having to associate it to a Resource.
Unassociated Assets may be used to automatically create ResourceBases and attach the asset to them.

Deprecate Documents

Once Assets gain their characterization, the Document object will not have much of a meaning, also considering that users upload as a Document any object that is not published as a Layer.
This means that we will be able to remove the Document class, and convert its instances into ResourceBases with an Asset handling the former document's data.

Feedback

Update this section with relevant feedbacks, if any.

Voting

Project Steering Committee:

Alessio Fabiani:
Francesco Bartoli:
Giovanni Allegri: +1
Toni Schoenbuchner: +1
Florian Hoedt:

Links

Remove unused links below.

The text was updated successfully, but these errors were encountered:

etj · 2024-04-02T15:33:14Z

Refactoring data upload procedure

The initial Asset implementation could be completely hidden to the GeoNode user, since the changes are only applied on the backend logic.

When a user uploads some data, the original data will be saved as an Asset.

Then, some heuristic will find the type of the uploaded data:

If data is related to geographical data, a Dataset will be automatically created (so the implementation changes will be invisible to the end user)
If data is a doc, png, image etc, a Document will be created

When the Resource is created, the Asset pointing to the uploaded data is linked to the ResourceBase via a Link (the new nullable asset foreign key shall be added).
Examples:

If it's a geographic data (shp, tiff, ...) it will be added as a "original data" link (saving such original data could be enabled or disabled via settings)
Documents should already be stored in a similar way

We may need to split the logic

Upload data and create an Asset
Create a Resource from an existing Asset
In this way, once we handle unassociated Assets, we may be able to run the creation of the related Resources in unattended commands.

etj · 2024-04-02T15:39:09Z

Authorization

An improvement comes for free with the Asset refactoring: at the moment downloadable files are public: if a URL for a data resource leaks out from someone having access to the Resource, such URL can be used by anybody to download the data file.

By checking the authorizations for the URL accessing the Assets' data, we'll add protection to the published data, allowing the download only to users having access to the Resource.

gannebamm · 2024-04-04T10:37:55Z

I generally like the idea of assets and forming a resource out of multiple assets. This was also discussed beforehand in a research data infrastructure group and we thought about using the Research Object Crate (RO-Crate) concept or the Annotated Research Object (ARC) concept as our 'assets'. It was just a short discussion and we have yet to do anything in terms of how to incorporate it into the GeoNode architecture. But these parts of the mentioned Motivation are of particular interest to us as research institutes:

Motivation
[...]

Allows the possibility to link a single ResourceBase with multiple data files (think for instance about a Document having multiple PDF files for different languages).

Allows the definition of a directory hierarchy as a single data asset, making it possible to publish complex data.

Here is an excerpt of the brief discussion the research infrastructure group had regarding RO-Crates. It is a bit outdated, but I think you get the gist of it:

Looking at other data portals like CKAN or OpenAgrar (based on MyCORE framework), you can describe a dataset which consists of multiple files/resources. Here are two examples:

https://demo.ckan.org/dataset/sample-dataset-1

https://www.openagrar.de/receive/openagrar_mods_00054877?lang=en

The latter example on a GeoNode instance:

https://atlas.thuenen.de/layers/geonode_data_ingest:geonode:bze_lw_standorte_verschleiert

Were the additional files are linked as documents:

There is a GeoNode developer workshop creating a so-called GeoCollection object to link multiple GeoNode ResourceBase objects together: https://docs.geonode.org/en/master/devel/workshops/index.html#create-your-own-django-app

My idea is to build on top of this concept and try to implement RO-Crate as a collection object: https://www.researchobject.org/ro-crate/

RO-Crates do use a metadata JSON to describe the Crate: https://www.researchobject.org/ro-crate/1.1/root-data-entity.html
In this JSON, datasets can be defined as web resources: https://www.researchobject.org/ro-crate/1.1/data-entities.html#web-based-data-entities

Most (all?) of the listed attributes of those datasets can be read by the GeoNode API for the bundled resources. Therefore, you only need to describe the ROCrate bundle itself.

I do not propose using RO-Crates as base implementation for assets! I just wanted to make clear that the underlying motivation is interesting for a part of the GeoNode community.

gannebamm · 2024-04-04T10:50:54Z

@etj in the implementation ERD diagram, the link between RessourceBase (through Link) to Asset is shown as 0..1. Shouldn´t this be a 0..n since multiple Assets can form one RessourceBase?

In a settings file in which the storing of original data is disabled, there will be no Asset for a Dataset. In our workflow, datasets are often ingested via PostgreSQL directly and then registered with the updatelayers command. For those, there is also no Asset per se. Do you think this will be an issue? It is marked as 0..n, so it should be ok from a database model standpoint, but is it ok from a user's perspective?

etj · 2024-04-04T11:06:13Z

@gannebamm, about cardinalities:
An Asset instance is an internal representation of files or data. Each Asset is presented to the external clients as a Link.
So

a Link is associated to 0..1 Asset. For instance, Links to WMS resources do not have associated Assets.
a ResourceBase is associated to 0..N Links (as already it is now)

Datasets or other ResourceBase can have no associated Assets at all, as in the case of Datasets only related to GeoServer layers.

ridoo · 2024-04-12T14:15:48Z

@etj I like to idea making things more flexible here. I took some time to think about the GNIP and want to make some comments, also by sprinkling in questions and personal opinions. However, I cannot forsee what components and workflows (e.g. geonode-importer) have to be touched in the end.

Technical questions

By having different subclasses of Asset (e.g. LocalAsset, S3Asset, ...) we may have a GeoNode instance handling datafiles on different data store backends.

Does this mean that each asset has its own StorageManager/-Handler where actual download is being delegated to? Does this complement or even rescind the changes you did recently to the DownloadHandler?

What if there is a SLD (or any other satellite file)? is it a separate Asset?

To me, this is definetely an asset on its own which also could be applied to multiple resources. However, what about differentiating xml files which shall serve as an asset and those xml files to be interpreted as metadata file.

Backwards Compatibility
API: old files array can be preserved in output

I could not find a files field in the resource API. As far I can see, the ResourceBase.files includes the local paths to the files uploaded originally. Right now, it is unclear to me if these are used somewhere (besides extracting some metadata (e.g. exif) during the import process).

This means that we will be able to remove the Document class, and convert its instances into ResourceBases with an Asset handling the former document's data.

So ResourceBase is going to become a first class citizen and serves as a logical brace for simple assets, right?

From the end user perspective

Is an asset available from the catalogue? From your UML an asset can live without any resource. How assets are managed in GeoNode?
How "download" as we know it so far relates to "downloading" an Asset? Can we get the terms consice here. Should we differentiate "exporting" things (resource) from "downloading" (assets) things?
Taking your example of a Document referring to multiple translated PDF files: What does the user actually download when "downloading" such document (resourcebase)? A zipped file containing all related assets? Maybe the user should be able to select the actual asset(s) of interest.
How do you separate from assets which belong to the resource naturally from those linked later to it. There might be many linked assets over time, both those of interest and those not of interest at all. How do we enable the user to reason which may be of interest for her use case?

Opportunities

Besides those opportunities you mentioned already, I see the following:

As @gannebamm mentioned, the Asset concept would leverage to assemble download packages
Make the link ResourceBase ➡️ Asset semantically pluggable. This could be even extendable by uploaded ontologies.

giohappy · 2024-04-15T15:43:42Z

@ridoo thanks for your comments. You touched on several points that we also included in out discussion. Many of them will probably come in the future, since the concept of Assets could bring a copernican changes to GeoNode in many ways...

Let me explain the current scope of this proposal first.
Assets will provide the foundation for many use cases. The ones we're facing now are:

publishing of 3D Tiles assets. We want to be able to "attach" a 3D tiles folder to a resource and serve it to the client, which already supports 3D tiles visualization
porting of an existing Geonetwork catalog, which also contains collections of files in single records

We're not going to cover the management of Assets from the GeoNode UI in this initial implementation. For the moment we only want to prepare the models to support present and future functionalities. We assume that the resource has been configured with assets in some way (DB operations, Django Admin, whatever).

an Asset could be selected as "primary" (maybe a better naming?) to indicate that it is the source this BaseResource represents (also for previews, etc.). We could also have resources without any "primary". In that case, the resource will only represent a collection of assets, and only the metadata detail page will be available. @etj could you provide more details about this aspect?
an Asset can live without a resource, yes in theory. BTW the initial implementation will transform the upload of a resource into the following steps, so the "free asset" won't be visible to end users:
- a file is uploaded
- an asset is created from this file (this is the status where the asset lives on its own)
- a new resource (dataset, document) is created and the asset assigned to it
Assets will be downloaded from links that will be visible in a new tab of the info panel. Download of the resource itself will follow the current approach, for the moment (WPS for datasets, etc.), but we will have the option to restore the possibility to also download the original "primary" asset. This is one of the resource downloads API field was made an array ;)

@etj can you please confirm, correct, extend the points above?
I also agree with @ridoo that the point about the fields API should be clarified.

ridoo · 2024-04-17T11:07:10Z

@giohappy @etj looking forward for some details.

You mentioned "copernican changes". To me, this sounds bigger than 'just' introducing an additional concept (here Asset). Just to stay curious: Is there more you have in mind?

giohappy · 2024-04-17T13:08:30Z

You mentioned "copernican changes". To me, this sounds bigger than 'just' introducing an additional concept (here Asset). Just to stay curious: Is there more you have in mind?

Let's say that this is the first step that could bring to more important changes in the future. We don't have a roadmap, actually, but making the relation between Catalog resource and Data source could:

lighten the Resource model and its polymorphic derivatives
simplify the publishing of new real or virtual data sources

Regarding the details on the points discussed in your comment, we need to wait for @etj which is working hard these days to connect the dots and prepare more information to share :)

giohappy · 2024-04-29T12:23:13Z

@ridoo after reviewing with @etj the status of this PR (which is ready for review), we confirm that it neither changes not adds features to GeoNode for the moment. In terms of public APIs and functionality it behaves exactly the same as before, with files and the single local storage manager replaced by Assets and their specific storage and download managers.

The next steps will be the implementation of the "primary" asset concept and the multiplicity of assets that can be assigned/downloaded to/from a resource. It will come with a new GNIP.

etj assigned etj and mattiagiupponi Apr 2, 2024

etj added the gnip A GeoNodeImprovementProcess Issue label Apr 2, 2024

mattiagiupponi added a commit that referenced this issue Apr 10, 2024

[Fixes #12124] Implementation of assets

a6bbe39

mattiagiupponi added a commit that referenced this issue Apr 11, 2024

[Fixes #12124] Implementation of assets

a56d39c

etj added a commit that referenced this issue Apr 22, 2024

[Fixes #12124] GNIP 100: Assets

ad98179

etj added a commit that referenced this issue Apr 22, 2024

[Fixes #12124] GNIP 100: Assets

03205b1

etj added a commit that referenced this issue Apr 23, 2024

[Fixes #12124] GNIP 100: Assets

ef0d851

etj added a commit that referenced this issue Apr 23, 2024

[Fixes #12124] GNIP 100: Assets

c04e13a

etj added a commit that referenced this issue Apr 23, 2024

[Fixes #12124] GNIP 100: Assets

7d0f19c

etj added a commit that referenced this issue Apr 23, 2024

[Fixes #12124] GNIP 100: Assets

36f0de3

etj added a commit that referenced this issue Apr 24, 2024

[Fixes #12124] GNIP 100: Assets

ce319e1

etj added a commit that referenced this issue Apr 24, 2024

[Fixes #12124] GNIP 100: Assets

49d489b

etj added a commit that referenced this issue Apr 24, 2024

[Fixes #12124] GNIP 100: Assets

535d580

mattiagiupponi added a commit that referenced this issue Apr 29, 2024

[Fixes #12124] Fix broken tests

32327a9

mattiagiupponi added a commit that referenced this issue Apr 30, 2024

[Fixes #12124] Fix broken tests

a6cf9f9

mattiagiupponi added a commit that referenced this issue Apr 30, 2024

[Fixes #12124] Fix broken tests

7737447

mattiagiupponi added a commit that referenced this issue Apr 30, 2024

[Fixes #12124] Fix broken tests

47c6c13

mattiagiupponi added a commit that referenced this issue Apr 30, 2024

[Fixes #12124] Fix broken tests

80311a5

mattiagiupponi added a commit that referenced this issue Apr 30, 2024

[Fixes #12124] Fix broken tests

1ec7613

etj added a commit that referenced this issue May 6, 2024

[Fixes #12124] GNIP 100: Assets - improve model

9d1697b

etj added a commit that referenced this issue May 7, 2024

[Fixes #12124] GNIP 100: Assets - fix resource manager field list

e7c8519

etj added a commit that referenced this issue May 7, 2024

[Fixes #12124] GNIP 100: Assets - minor test "in" fix

c1028fb

etj added a commit that referenced this issue May 7, 2024

#12124 Remove keepdb from test.sh

1d023cc

etj added a commit that referenced this issue May 7, 2024

#12124 Remove keepdb from test.sh

bb6c54e

etj added a commit that referenced this issue May 7, 2024

[Fixes #12124] GNIP 100: Assets - improvements in clone and delete

5806872

etj added a commit that referenced this issue May 9, 2024

[Fixes #12124] GNIP 100: Assets - add tests and (un)managed files

291f93e

etj added a commit that referenced this issue May 9, 2024

[Fixes #12124] GNIP 100: Assets - tests

64b4dee

etj added a commit that referenced this issue May 9, 2024

[Fixes #12124] GNIP 100: Assets - improve file handling

1e3b401

etj added a commit that referenced this issue May 10, 2024

[Fixes #12124] GNIP 100: Assets - cleanup tmp doc file

da90a2a

etj linked a pull request May 10, 2024 that will close this issue

[Fixes #12124] GNIP 100: Assets #12179

Draft

12 tasks

etj added a commit that referenced this issue May 13, 2024

[Fixes #12124] GNIP 100: Assets

8e9ed53

etj mentioned this issue May 13, 2024

Assets #12225

Draft

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNIP 100: Assets #12124

GNIP 100: Assets #12124

etj commented Apr 2, 2024 •

edited by t-book

etj commented Apr 2, 2024

etj commented Apr 2, 2024

gannebamm commented Apr 4, 2024

gannebamm commented Apr 4, 2024

etj commented Apr 4, 2024 •

edited

ridoo commented Apr 12, 2024 •

edited

giohappy commented Apr 15, 2024

ridoo commented Apr 17, 2024

giohappy commented Apr 17, 2024 •

edited

giohappy commented Apr 29, 2024

GNIP 100: Assets #12124

GNIP 100: Assets #12124

Comments

etj commented Apr 2, 2024 • edited by t-book

GNIP 100 - Assets

Overview

Proposed By

Assigned to Release

State

Motivation

Proposal

Other usages for assets

Permissions

Implementation

Model:

Logic:

DB migration:

API:

Authorization

Backwards Compatibility

Future evolution

Decoupled uploads

Deprecate Documents

Feedback

Voting

Links

etj commented Apr 2, 2024

Refactoring data upload procedure

etj commented Apr 2, 2024

Authorization

gannebamm commented Apr 4, 2024

gannebamm commented Apr 4, 2024

etj commented Apr 4, 2024 • edited

ridoo commented Apr 12, 2024 • edited

giohappy commented Apr 15, 2024

ridoo commented Apr 17, 2024

giohappy commented Apr 17, 2024 • edited

giohappy commented Apr 29, 2024

etj commented Apr 2, 2024 •

edited by t-book

etj commented Apr 4, 2024 •

edited

ridoo commented Apr 12, 2024 •

edited

giohappy commented Apr 17, 2024 •

edited