Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make harvester more defensive against bad data #4136

Open
2 tasks
swirtSJW opened this issue Feb 23, 2024 · 3 comments
Open
2 tasks

Make harvester more defensive against bad data #4136

swirtSJW opened this issue Feb 23, 2024 · 3 comments
Assignees

Comments

@swirtSJW
Copy link

swirtSJW commented Feb 23, 2024

User Story

As a person monitoring dataset harvesting, it would be nice if the harvester did some additional pre-harvest validation to prevent bad data from being imported.

Acceptance Criteria

  • Prevents harvesting if data is not UTF8
  • Prevent harvesting if BOM character is present (or maybe silently remove it?)
    • Likely need this to be an option that is off by default as apparently no all data streams care.
      There may be some additional checks needed but I need to determine if they are generic enough to apply to all, or if they perhaps need to be optional.
@swirtSJW swirtSJW self-assigned this Feb 23, 2024
@github-actions github-actions bot added this to Incoming/Triage in DKAN 2 Issue Triage Feb 23, 2024
@swirtSJW
Copy link
Author

I am still sorting out what is specific to PDC and what is generic enough for broader use.

@swirtSJW swirtSJW changed the title Make harvester more defensive bad data Make harvester more defensive against bad data Feb 23, 2024
@dafeder
Copy link
Member

dafeder commented Apr 5, 2024

At minimum, I think it makes sense to document how to implement this kind of "kill switch" in a Harvest Extract class.

@swirtSJW
Copy link
Author

swirtSJW commented Apr 5, 2024

I will be working on this in the next sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
DKAN 2 Issue Triage
  
Incoming/Triage
Development

No branches or pull requests

2 participants