Move datasets to delete first in line #261

jbrown-xentity · 2021-10-22T20:05:53Z

We have reports at data.gov of datasets that get re-harvested with an extra 1
in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if
this is a new dataset or not;
but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it.
By running through the datasets removal first,
if the spatial harvester is essentially doing a "delete and add"
when it should be replacing, then the name of the new dataset
won't collide with the one that is marked for deletion
but still in the system. This will keep the URL the same, and not break as many workflows.

We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deletion but still in the system.

amercader · 2021-10-28T11:35:44Z

@jbrown-xentity It's been a long time since I worked on this but IIRC the harvesters call package_delete to delete a dataset, which will mark it as deleted but leave it on the database (as opposed to a package_purge call), which means that the dataset name can't be used when creating a new one. Can you expand on why changing the order in which "to delete" harvest objects are created helps in this case? (I'm sure the changes help, I just want to understand better)

If the harvest is managing the datasets in ckan, it seems that the harvest source should be the "source of truth". If this is the case, we shouldn't need "revive" capability of soft removing packages/datasets in ckan. I propose to actually purge the dataset within ckan. Since it's difficult/nearly impossible to track these files without a unique id, sometimes the harvester will delete and create a new item if the waf or files change in any way. This would keep that behind the scenes, and allow the end user to get to the same dataset at the old URL

jbrown-xentity · 2021-10-28T14:03:02Z

@amercader no, I believe you're right: we would need to purge the dataset. I forgot about that functionality. I believe we actually should be purging; I don't see a likely scenario where a user would want to keep or "revive" a dataset that was harvested and has been removed from source... I updated the PR to include the "purge" command instead of "delete".

ccancellieri · 2021-10-28T15:53:30Z

I'm experiencing a problem after having purged a dataset harvevsted.
The next loop it will not be harvested anymore since the HarvestObject is still there tracking the date of last modification.
As result you have to go (as I'm doing) in the DB to remove the harvest object by GUID.

I think that the purge may take care eventually of harvest object or... (since the core cant depend on an extension) we've to provide purge for harvest object table.

jbrown-xentity mentioned this pull request Oct 24, 2022

Move datasets to delete first in line GSA/ckanext-spatial#14

Merged

nickumia-reisys mentioned this pull request Oct 24, 2022

Reintegrate ckanext-spatial upstream into catalog.data.gov GSA/data.gov#3938

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move datasets to delete first in line #261

Move datasets to delete first in line #261

jbrown-xentity commented Oct 22, 2021

amercader commented Oct 28, 2021

jbrown-xentity commented Oct 28, 2021

ccancellieri commented Oct 28, 2021

Move datasets to delete first in line #261

Are you sure you want to change the base?

Move datasets to delete first in line #261

Conversation

jbrown-xentity commented Oct 22, 2021

amercader commented Oct 28, 2021

jbrown-xentity commented Oct 28, 2021

ccancellieri commented Oct 28, 2021