Replies: 10 comments 1 reply
-
Reposting WASM target from the qsv repo:
IMHO, this will really modernize CKAN's UI/UX experience, and greatly reduce data publishing friction. It can even incentivize usage and encourage users to populate their catalogs, even if they're not publishing the data and the datasets are private - with CKAN operating more as an Enterprise Data Management System/Data Exchange - as they get:
|
Beta Was this translation helpful? Give feedback.
-
Beyond guaranteed data type inferences by qsv, the following can be done by WASM qsv on the client side:
Currently, these are all done by Datapusher+ in a performant manner, but as @wardi pointed out, this is still hampered by the legacy workflow. Doing these pre-processing activities on the client side makes more sense. It also saves on bandwidth. Additional pre-processing tasks that can be done on the client side with WASM qsv are:
|
Beta Was this translation helpful? Give feedback.
-
As mentioned in the dev call, we've been using a small extension Schema-Field-Groups (https://github.com/derilinx/ckanext-schema_field_groups/) on a few of our sites to provide a way to segment the metadata for a dataset. This provides a way to hide the less commonly used portions of the metadata, while surfacing the common stuff on the first few tabs. In terms of resource level metadata, I'd really like the data dictionary to be more available. We're currently using it in 'dataset at a time' sorts of things, but it would really be useful if it was pushed into solr so that it could be effectively used in dataset listings and searches. |
Beta Was this translation helpful? Give feedback.
-
I've mentioned one of my projects that don't allow sharing the original implementation(at least, for now), but I've managed to extract a part of it into a separate plugin This functionality allows the fragment creation for the dataset/resource. I've mentioned use cases a bit similar to "resource-first dataset creation" and "multistep dataset form" in the Where you can use it? section of the plugin's readme. As for independent file entities, their implementation is a bit more low-level. I'll try to extract and share it in a few weeks. Or, if it will stay as hacky as it is right now, I'll just start porting it into CKAN core. |
Beta Was this translation helpful? Give feedback.
-
The main drive for a resource-first approach (from my client) was to support quotas [and optionally billing]. The goals (so far as I remember them years later):
IIRC the idea to support the last two in ckanext-cloudstorage was to add |
Beta Was this translation helpful? Give feedback.
-
Follow-up and some further thinking on this: @jqnatividad has found some limitations to running
It might still be worth exploring developing an uploader in webassembly that can identify the column headings and compute a hash while uploading. That would be light enough to run on a single thread and could allow entering some column details while the data is loaded on the server side in the background. It could also prevent unnecessary updates (notify the user that the file hasn't changed before they submit, and prevent a re-upload) So, we can't avoid some server-side processing of uploaded data and it would be much nicer for that to happen separate from the resource form in some cases. If files can be uploaded and linked to the user by default then we could trigger background jobs like @EricSoroos 's ckanext-schema_field_groups and some view modifications could be used to present the multi-stage dataset creation form after the files are uploaded without major changes to core. But, building a whole new set of views to handle new types of dataset creation would let users save their changes as they move through any kind of desired workflow. I believe this is what @smotornyuk means by using ckanext-flakes for a multistep dataset form. This would have to handle editing as well, and sync validation rules from ckanext-scheming (if we're still using that to define the schema). I wonder though, with "real" separate forms for each stage of a multistep dataset form, would API users be willing to accept having their fields grouped in the same way in the metadata? I expect that moving fields around between the steps will be a common client requirement so having the fields decoupled from the form in which they appear, to maintain API stability, would be desired. Having actual datasets as the "final product" and a separate form workflow for creating or editing them seems really nice but there will be duplication of logic and it will be tricky to share custom validators and form snippets if we're using ckanext-scheming to define the schemas and forms. ckanext-scheming was really built with all the IDatasetForm and ckan validation assumptions baked-in. We might be better building the feature into ckanext-scheming (i.e. extending the yaml to describe multiple forms and the default workflow) or using a completely different way to define the forms and validation rules. |
Beta Was this translation helpful? Give feedback.
-
On the client-side schema inferring if Wasm it's still not a viable option (I don't think it is as a default option for all users) we can definitely leverage previous work on ckanext-validation that used JS Frictionless Data components to infer the schema on the client. It might not be as accurate as Everything in this demo is horrible looking and has been replaced with more modern alternatives (it's still CKAN 2.8) but it works as a Proof of Concept to get an idea of how it would work (code is here). ckanext-validation.mp4Perhaps we can use that as a first implementation and once the WASM option becomes more widely supported we refactor the schema inferring logic |
Beta Was this translation helpful? Give feedback.
-
@amercader your approach seems to be the most practical at the moment. And yes, as you pointed out, we can still do some server-side inferencing with As for WASM, qsv does have a sniff command that's lighter-weight than the
It doesn't offer the same inferencing guarantees as IMHO, "perfect is the enemy of good", and an ideal upload workflow will necessarily involve informed experiments. And yes @wardi, it seems ckanext-scheming is the logical fulcrum point to really have "better dataset creation", as users can add the workflow/biz rule components to the yaml file. |
Beta Was this translation helpful? Give feedback.
-
@amercader that looks really nice. Should be straightforward to repurposed the code to pre-populating data dictionary values instead of ckanext-validation's json schema field, or maybe just to offer a preview of the data before it's uploaded right in the form so that users know they chose the right file. i.e. select the file in the form and the columns and details appear immediately below. This could be the start of a general improved resource uploader. If uploads are separated from the resource model we could even start uploading immediately in the background with a slick little progress bar going while users enter other resource metadata. |
Beta Was this translation helpful? Give feedback.
-
@jqnatividad yes, I'm leaning that way too. It would be really nice to have the pages of metadata entry described in scheming yaml files right with the field definitions. This might mean extending IDatasetForm to allow defining a list of pages for dataset and resource metadata entry and implementing metadata page editing in core with selected fields updated using |
Beta Was this translation helpful? Give feedback.
-
Better Dataset Creation
It should be easier to adapt dataset creation for client-preferred workflows such as:
Problem 1. Upload legacy
Uploads are currently tied to our resource form, resource-metadata-editing APIs and the IUploader interface. This limits our workflows and mixes things that should be fast (edit resource metadata, return validation errors) with something that's always slow (synchronously upload a large file).
Decoupling uploads from resource metadata-editing at the action and IUploader interface layer will let our fast things be fast so we can build nicer interfaces. It would also allow one or more uploads to be attached to Groups, Organizations, Datasets and Users by only changing their schemas.
Problem 2. Slow table analysis
After an upload completes a background job or service is triggered to start processing the file. That service needs to put the job in a queue, download the whole file, analyze it, decide if it has changed, decide how to convert it into datastore columns and then load the data.
If we want to generate metadata from a file upload we have to wait for all of those steps to complete before the user can continue editing metadata.
datapusher-plus improves this process by using QSV for the analysis to reliably determine column types very quickly, but still follows this model, the same as xloader and datapusher.
Instead can we analyze upload data on the client side before uploading it? If QSV was run as webassembly on the client side, the client's browser would know the all column details very quickly. This would allow one page on the front-end to lead immediately to metadata editing based on the file before the file upload even starts.
If we calculate the hash on the client at the same time we could prevent duplicate uploads and verify upload integrity on the server after it's complete as well.
Problem 3. Fixed dataset and resource forms
ckan has one way of creating datasets:
IDatasetform and ckanext-scheming let us customize the fields on these screens and the validation rules but not change the workflow. Also if two people have the same form loaded and both click "save" the changes from one user will be completely lost.
Instead of these two fixed forms we could have any number of forms that display part of a dataset's metadata and use
package_revise
to modify only the fields that are changed. We could experiment with saving changes on blur (when the user leaves a form field) instead of with a Save button, and refresh the form with any new validation errors immediately displayed. The original value of a field could be passed topackage_revise
so that one user can't accidentally overwrite another user's changes.We would need a way to solve errors across forms (conditional validation across forms or required fields in unfilled forms) perhaps by saving field changes on the client and flipping between forms single-page-application-style. This would require a
dry_run
parameter onpackage_revise
and a Save button that would apply to all changes.This type of dataset editing would enable custom workflows without any dirty hacks and could offer a smoother user experience.
References
Previous discussion: #6689
@smotornyuk @EricSoroos @TkTech would you be able to add something about the upload and metadata-editing work that you've done for your clients, and any lessons learned?
Beta Was this translation helpful? Give feedback.
All reactions