Better Dataset Creation (Resource-first "Add Dataset" workflow follow-up) #6869

wardi · 2022-05-19T16:28:24Z

wardi
May 19, 2022
Maintainer

Better Dataset Creation

It should be easier to adapt dataset creation for client-preferred workflows such as:

Upload a primary resource first; primary resource automatically populates a new dataset's metadata; user customizes metadata
Upload many files in one place; add uploaded files to datasets as resources
Custom multi-page metadata forms; allow users to complete some or all steps based on need

Problem 1. Upload legacy

Uploads are currently tied to our resource form, resource-metadata-editing APIs and the IUploader interface. This limits our workflows and mixes things that should be fast (edit resource metadata, return validation errors) with something that's always slow (synchronously upload a large file).

Decoupling uploads from resource metadata-editing at the action and IUploader interface layer will let our fast things be fast so we can build nicer interfaces. It would also allow one or more uploads to be attached to Groups, Organizations, Datasets and Users by only changing their schemas.

Problem 2. Slow table analysis

After an upload completes a background job or service is triggered to start processing the file. That service needs to put the job in a queue, download the whole file, analyze it, decide if it has changed, decide how to convert it into datastore columns and then load the data.

If we want to generate metadata from a file upload we have to wait for all of those steps to complete before the user can continue editing metadata.

datapusher-plus improves this process by using QSV for the analysis to reliably determine column types very quickly, but still follows this model, the same as xloader and datapusher.

Instead can we analyze upload data on the client side before uploading it? If QSV was run as webassembly on the client side, the client's browser would know the all column details very quickly. This would allow one page on the front-end to lead immediately to metadata editing based on the file before the file upload even starts.

If we calculate the hash on the client at the same time we could prevent duplicate uploads and verify upload integrity on the server after it's complete as well.

Problem 3. Fixed dataset and resource forms

ckan has one way of creating datasets:

fill in all your dataset metadata
click save (correct errors and repeat)
fill in resource metadata and upload a file
click save (correct errors and repeat)
(repeat from 3 for additional resources)

IDatasetform and ckanext-scheming let us customize the fields on these screens and the validation rules but not change the workflow. Also if two people have the same form loaded and both click "save" the changes from one user will be completely lost.

Instead of these two fixed forms we could have any number of forms that display part of a dataset's metadata and use package_revise to modify only the fields that are changed. We could experiment with saving changes on blur (when the user leaves a form field) instead of with a Save button, and refresh the form with any new validation errors immediately displayed. The original value of a field could be passed to package_revise so that one user can't accidentally overwrite another user's changes.

We would need a way to solve errors across forms (conditional validation across forms or required fields in unfilled forms) perhaps by saving field changes on the client and flipping between forms single-page-application-style. This would require a dry_run parameter on package_revise and a Save button that would apply to all changes.

This type of dataset editing would enable custom workflows without any dirty hacks and could offer a smoother user experience.

References

Previous discussion: #6689

@smotornyuk @EricSoroos @TkTech would you be able to add something about the upload and metadata-editing work that you've done for your clients, and any lessons learned?

jqnatividad · 2022-05-19T20:09:02Z

jqnatividad
May 19, 2022

Reposting WASM target from the qsv repo:

Expanding on this further based on @wardi's "Better Dataset Creation (Resource-first Upload Workflow)" post, WASM qsv should have:

the same subset of qsv commands as qsvdp - the datapusher-plus optimized binary, so we can do more of the pre-processing currently done by DP+ on the server side, on the client.
https://github.com/jqnatividad/qsv/blob/9ad77a182f59b59ebdaf8cdf71213328eb53d8ed/src/maindp.rs#L42-L60 >

schema command, so we can create a jsonschema which can be used to prepopulate the next-gen Data Dictionary/Table Builder, and even validate against it

as a stretch goal, and definitely for a future implementation - enable validation scripting to enforce business rules whilst entering form data (maybe use Lua, rhai or RustPython)

IMHO, this will really modernize CKAN's UI/UX experience, and greatly reduce data publishing friction.

It can even incentivize usage and encourage users to populate their catalogs, even if they're not publishing the data and the datasets are private - with CKAN operating more as an Enterprise Data Management System/Data Exchange - as they get:

an embeddable, modern web-based, tabular viewer of their data with DataTables view, with the upcoming SearchBuilder feature
with a Datastore API
with all kinds of metadata (descriptive stats, auto-tags, spatial extent, etc.), a lot of it automatically inferred by WASM qsv
a validating data dictionary
with a detailed Activity Stream
with Authentication
with Collaborator controls

0 replies

jqnatividad · 2022-05-20T17:20:26Z

jqnatividad
May 20, 2022

Beyond guaranteed data type inferences by qsv, the following can be done by WASM qsv on the client side:

deduplicate rows
check for RFC 4180 compliance (valid CSV)
convert Excel/ODS files to CSV
create a sample preview slice

Currently, these are all done by Datapusher+ in a performant manner, but as @wardi pointed out, this is still hampered by the legacy workflow.

Doing these pre-processing activities on the client side makes more sense. It also saves on bandwidth.

Additional pre-processing tasks that can be done on the client side with WASM qsv are:

do a preliminary PII screen (using qsv's searchset function, using a searchset like this). As an added benefit of doing the screening on the client, the data does not go to the server, so the transit/server exposure is eliminated.
prepopulate spatial extent metadata fields (using min/max lat/long values already computed by qsv stats when inferring data types)
do low-resolution, city-level geocoding with qsv's apply geocode command
automatically partition the data using qsv's partition command (e.g. by date, location, agency, etc.)

0 replies

EricSoroos · 2022-05-22T14:20:41Z

EricSoroos
May 22, 2022
Maintainer

As mentioned in the dev call, we've been using a small extension Schema-Field-Groups (https://github.com/derilinx/ckanext-schema_field_groups/) on a few of our sites to provide a way to segment the metadata for a dataset.

This provides a way to hide the less commonly used portions of the metadata, while surfacing the common stuff on the first few tabs.

In terms of resource level metadata, I'd really like the data dictionary to be more available. We're currently using it in 'dataset at a time' sorts of things, but it would really be useful if it was pushed into solr so that it could be effectively used in dataset listings and searches.

0 replies

smotornyuk · 2022-05-24T17:50:45Z

smotornyuk
May 24, 2022
Maintainer

I've mentioned one of my projects that don't allow sharing the original implementation(at least, for now), but I've managed to extract a part of it into a separate plugin

This functionality allows the fragment creation for the dataset/resource. I've mentioned use cases a bit similar to "resource-first dataset creation" and "multistep dataset form" in the Where you can use it? section of the plugin's readme.

As for independent file entities, their implementation is a bit more low-level. I'll try to extract and share it in a few weeks. Or, if it will stay as hacky as it is right now, I'll just start porting it into CKAN core.

1 reply

wardi May 24, 2022
Maintainer Author

Thank you @smotornyuk and yes a core change that separated uploads would be amazing.

TkTech · 2022-05-26T08:29:57Z

TkTech
May 26, 2022

The main drive for a resource-first approach (from my client) was to support quotas [and optionally billing]. The goals (so far as I remember them years later):

Resources could be uploaded to an organization's shared storage, or to a user's storage.
Quotas could be set on a per-org and/or per-user basis.
An org with a quota could optionally share part of its quota with specific users
Storage and bandwith for an org would invoice at the end of the month. This was an inter-department thing, if you were doing this as a public extension and wanted to support this I'd imagine you'd just use Stripe.
- This worked with S3 signed URLs by reading the S3 access logs and getting the bandwith usage.
Use https://tus.io/ to implement long-running and resumable uploads with a decent UX.
- This supports requesting a signed S3 URL and sending it to the client for a direct client->S3 upload skipping CKAN.

IIRC the idea to support the last two in ckanext-cloudstorage was to add get_usage() and get_writable_url() as optional methods on IUploader?

0 replies

wardi · 2022-06-01T21:01:32Z

wardi
Jun 1, 2022
Maintainer Author

Follow-up and some further thinking on this:

@jqnatividad has found some limitations to running qsv in webassembly on the client side:

difficulty compiling some parts to exclude the standard library
web assembly is limited to one thread, significantly limiting performance

It might still be worth exploring developing an uploader in webassembly that can identify the column headings and compute a hash while uploading. That would be light enough to run on a single thread and could allow entering some column details while the data is loaded on the server side in the background. It could also prevent unnecessary updates (notify the user that the file hasn't changed before they submit, and prevent a re-upload)

So, we can't avoid some server-side processing of uploaded data and it would be much nicer for that to happen separate from the resource form in some cases. If files can be uploaded and linked to the user by default then we could trigger background jobs like qsv on the uploaded data and use an extension like @smotornyuk 's ckanext-flakes to store the results, or we could add a generic info dict in the new file upload model we'll need to create anyway.

@EricSoroos 's ckanext-schema_field_groups and some view modifications could be used to present the multi-stage dataset creation form after the files are uploaded without major changes to core.

But, building a whole new set of views to handle new types of dataset creation would let users save their changes as they move through any kind of desired workflow. I believe this is what @smotornyuk means by using ckanext-flakes for a multistep dataset form. This would have to handle editing as well, and sync validation rules from ckanext-scheming (if we're still using that to define the schema).

I wonder though, with "real" separate forms for each stage of a multistep dataset form, would API users be willing to accept having their fields grouped in the same way in the metadata? I expect that moving fields around between the steps will be a common client requirement so having the fields decoupled from the form in which they appear, to maintain API stability, would be desired.

Having actual datasets as the "final product" and a separate form workflow for creating or editing them seems really nice but there will be duplication of logic and it will be tricky to share custom validators and form snippets if we're using ckanext-scheming to define the schemas and forms. ckanext-scheming was really built with all the IDatasetForm and ckan validation assumptions baked-in. We might be better building the feature into ckanext-scheming (i.e. extending the yaml to describe multiple forms and the default workflow) or using a completely different way to define the forms and validation rules.

0 replies

amercader · 2022-06-02T10:03:01Z

amercader
Jun 2, 2022
Maintainer

On the client-side schema inferring if Wasm it's still not a viable option (I don't think it is as a default option for all users) we can definitely leverage previous work on ckanext-validation that used JS Frictionless Data components to infer the schema on the client. It might not be as accurate as qsv but it's a much simpler approach to develop and deploy (and we could still have separate server background jobs with qsv that improved the schema).

Everything in this demo is horrible looking and has been replaced with more modern alternatives (it's still CKAN 2.8) but it works as a Proof of Concept to get an idea of how it would work (code is here).

ckanext-validation.mp4

Perhaps we can use that as a first implementation and once the WASM option becomes more widely supported we refactor the schema inferring logic

0 replies

jqnatividad · 2022-06-02T12:43:45Z

jqnatividad
Jun 2, 2022

@amercader your approach seems to be the most practical at the moment. And yes, as you pointed out, we can still do some server-side inferencing with qsv to do some type of second-stage validation.

As for WASM, qsv does have a sniff command that's lighter-weight than the stats command that can be used for schema inferencing.

$ qsv sniff boston311-100.csv
Metadata
========
        Delimiter: ,
        Has header row?: true
        Number of preamble rows: 0
        Quote character: none
        Flexible: false
        Is utf-8 encoded?: true

Number of records: 100
Number of fields: 29
Fields:
    0:   Unsigned  case_enquiry_id
    1:   DateTime  open_dt
    2:   DateTime  target_dt
    3:   DateTime  closed_dt
    4:   Text      ontime
    5:   Text      case_status
    6:   Text      closure_reason
    7:   Text      case_title
    8:   Text      subject
    9:   Text      reason
    10:  Text      type
    11:  Text      queue
    12:  Text      department
    13:  Text      submittedphoto
    14:  Boolean   closedphoto
    15:  Text      location
    16:  Unsigned  fire_district
    17:  Text      pwd_district
    18:  Unsigned  city_council_district
    19:  Text      police_district
    20:  Text      neighborhood
    21:  Unsigned  neighborhood_services_district
    22:  Text      ward
    23:  Unsigned  precinct
    24:  Text      location_street_name
    25:  Unsigned  location_zipcode
    26:  Float     latitude
    27:  Float     longitude
    28:  Text      source

It doesn't offer the same inferencing guarantees as stats however, as it only samples a CSV, like messytables. It's enabled by qsv-sniffer, and I can certainly try to adapt it to create a WASM target, but that'll be a non-trivial effort. I'm adding an additional "weighted random sampling" to make its sampling more robust, but it still cannot offer guaranteed type inferences.

IMHO, "perfect is the enemy of good", and an ideal upload workflow will necessarily involve informed experiments.

And yes @wardi, it seems ckanext-scheming is the logical fulcrum point to really have "better dataset creation", as users can add the workflow/biz rule components to the yaml file.

0 replies

wardi · 2022-06-02T12:46:03Z

wardi
Jun 2, 2022
Maintainer Author

@amercader that looks really nice. Should be straightforward to repurposed the code to pre-populating data dictionary values instead of ckanext-validation's json schema field, or maybe just to offer a preview of the data before it's uploaded right in the form so that users know they chose the right file. i.e. select the file in the form and the columns and details appear immediately below.

This could be the start of a general improved resource uploader. If uploads are separated from the resource model we could even start uploading immediately in the background with a slick little progress bar going while users enter other resource metadata.

0 replies

wardi · 2022-06-02T12:53:58Z

wardi
Jun 2, 2022
Maintainer Author

@jqnatividad yes, I'm leaning that way too. It would be really nice to have the pages of metadata entry described in scheming yaml files right with the field definitions.

This might mean extending IDatasetForm to allow defining a list of pages for dataset and resource metadata entry and implementing metadata page editing in core with selected fields updated using package_revise. A simple implementation would allow some of the concurrent editing described in "problem 3" at the top (just don't have two users editing the same pages at the same time) and would be easier to get up and running than a fully safe version.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Dataset Creation (Resource-first "Add Dataset" workflow follow-up) #6869

{{title}}

Replies: 10 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Better Dataset Creation (Resource-first "Add Dataset" workflow follow-up) #6869

wardi May 19, 2022 Maintainer