Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing distribution inside a dataset creates new distribution #4054

Open
stefan-korn opened this issue Nov 9, 2023 · 8 comments
Open

Editing distribution inside a dataset creates new distribution #4054

stefan-korn opened this issue Nov 9, 2023 · 8 comments
Assignees

Comments

@stefan-korn
Copy link
Contributor

Describe the bug

Editing an embedded distribution inside a dataset (embedded distribution in dataset is default in DKAN) creates a new distribution node every time something is changed in the distribution. Is this intended behavior? Would not think so?

The reason for this is as far as I can see, that the function checkExistingReference() in Referencer class searches for the distribution title and that is a hash of all the values of the distribution. Now if the values have changed the hash is no longer the same and therefore checkExistingReference() fails and a new distribution is created.

This does not solely affect distributions but would affect any other referenced item that can be edited inside another.

Sorry for maybe asking some stupid questions:
Why is this hash of the data used as a title for a referenced node?
Couldn't one maybe use UUID as title for referenced nodes? The reference is linked by UUID as identifier anyway?

Steps To Reproduce

Create a dataset with a distribution.
check admin/dkan/datasets for distributions and see the distribution appearing once
Edit the created dataset and changes something in the distribution
check admin/dkan/datasets for distributions again and see there another distribution that has been created

Expected behavior

Do not create a new distribution on editing a distribution inside a dataset.

@github-actions github-actions bot added this to Incoming/Triage in DKAN 2 Issue Triage Nov 9, 2023
stefan-korn added a commit to stefan-korn/dkan that referenced this issue Nov 9, 2023
@stefan-korn
Copy link
Contributor Author

Please see PR #4055 for an idea how to handle this differently without creating a new distribution (or other embedded elements) every tme.

stefan-korn added a commit to stefan-korn/dkan that referenced this issue Nov 10, 2023
stefan-korn added a commit to stefan-korn/dkan that referenced this issue Nov 29, 2023
@dafeder
Copy link
Member

dafeder commented Dec 9, 2023

@stefan-korn for now it is expected behavior, although we are aware it is counter-intuitive. Because distributions are linked to datasets through a reference, and the reference system as it exists now only uses node UUIDs and not version IDs, there is no way to see previous versions of the dataset if we don't know which version of the distribution to load inside of it. So for now, datasets are versioned but referenced items are not, and are simply saved as new nodes so that both the old and new revisions of the dataset will be dereferenced to show the intended values for the distributions.

I think your PR would suffer from the same issue. We are looking into better solutions for this, since the way it is now does work but doesn't follow expected patterns in Drupal and creates an impractical number of distributions in a lot of cases. One question is, should distributions be referenced at all? Maybe they don't need to be, and we could just store the array of distributions within the dataset node...

If they do need to be stored, we should probably figure out a way to revision them so that we don't keep making new ones. But then we need to rethink references to include the version ID, otherwise we could be showing incorrect data for old revisions of the dataset.

@stefan-korn
Copy link
Contributor Author

stefan-korn commented Dec 11, 2023

@dafeder : Thanks for the explanation.

What do you mean by

One question is, should distributions be referenced at all? Maybe they don't need to be, and we could just store the array of distributions within the dataset node

No node for the distributions, saving them together with dataset? Technically there is this option now with unchecking this metastore setting?

Dataset properties to be stored as separate entities

Though this probably won't work because of some special handling of the distribution. But in a more general way, I suppose the problem with creating new entities for references is prevalent for other properties too that are stored as separate entities and will be edited inside the dataset.
One thing that does look a bit difficult to me is, that the the hash of the properties values is considered for deciding whether a new entity is created or not. Then if you maybe allow only a few values of the property be edited inside dataset and the full range of values only in the separate editing of the property, this maybe cause some troubles (though I am not sure, if this is considered to be valid practice now).

@dafeder
Copy link
Member

dafeder commented Dec 11, 2023

No node for the distributions, saving them together with dataset? Technically there is this option now with unchecking this metastore setting?

There is. I think the datasets would save but most likely other things would break. Certainly datastore would not work, and the frontend would have to be refactored. But I think in general we are not particularly well-served by having distributions be saved separately. The important thing is the dataset-to-resource relationship and the distribution ID/reference complicates it with no real benefits as far as I can see. So factoring out the distribution-specific code and just finding a way to signal to DKAN where resource URLs can be found in the dataset object would I think resolve a lot of these problems and also open the door to more diverse schema structures we could support.

@dafeder
Copy link
Member

dafeder commented Dec 11, 2023

Another reason for having distributions decoupled from datasets was that in theory you could have datasets that are published with distributions that are not. But I think few people are doing this and the same thing could be accomplished by having a published version of the dataset w/o the distribution and an unpublished draft that has it.

@stefan-korn
Copy link
Contributor Author

@dafeder : Coming back to your explanations on distribution I would like to know if you prefer (in the long term) to get rid of any "Dataset properties to be stored as separate entities" in the metastore (/admin/dkan/properties)? Or do you still see this as a valid approach?

I am currently thinking about integrating publishers in search api by providing a SearchApiDatasource and ComplexDataFacade analogous to how it is done with the dataset. If doing this, it would rely on publishers being saved as separate entities. Therefore if DKAN will be going away from the concept of saving dataset properties as separate entities, I maybe would not want to go this way with the Search API integration.

@dafeder
Copy link
Member

dafeder commented Apr 5, 2024

@stefan-korn it is a bit of an open question still to be honest. I think having distributions as separate entities is maybe more trouble than its worth, but there are a lot of situations where publishers really need to exist in the system somehow independently of the datasets. I would love to hear more about your use case and experience, maybe we can find a way to connect outside of github? :)

@dafeder dafeder self-assigned this Apr 5, 2024
@stefan-korn
Copy link
Contributor Author

@dafeder : Thanks for your reply. We are now relying in some cases on "Dataset properties to be stored as separate entities", so hopefully you will not remove this feature entirely in the future.

I still remember your offer for connection outside Github. We did not manage to get a scheduling on our end yet, but I will hopefully can come back to your offer in near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
DKAN 2 Issue Triage
  
Incoming/Triage
Development

No branches or pull requests

2 participants