New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tighten validation for id fields (followup) #8069
base: master
Are you sure you want to change the base?
Tighten validation for id fields (followup) #8069
Conversation
…b.com/shashigharti/ckan into shashigharti-7604/tighten-validation-for-id-fields
Following the pattern in the IDatasetForm, consolidate all schemas around three variants: * `create_group_schema` (base) * `update_group_schema` * `show_group_schema` All of them with defaults in `ckan/logic/schema.py`. The old methods have been kept for backwards compatibility.
Because we can not use different schemas for resources for update and create, this validators prevents resource updates for normal users.
Update on the current status:
|
raise Invalid(_('Invalid length for resource id')) | ||
def uuid_validator(value: Any) -> Any: | ||
try: | ||
uuid.UUID(value, version=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUIDs are case-insensitive. How about returning str(uuid.UUID(value, version=4))
to normalize the value? In this way, we'll avoid creating entries with the same UUID written in different cases (we have a similar problem for user emails)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 this. A decade ago we imported an organization list with uppercase UUIDs and now they're mixed with new orgs created with lowercase UUIDs. So untidy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, but if we change the value returned we might change the case of some existing upper case UUIDs when editing objects, wouldn't that cause potential issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this, this would only be an issue for resources and extras (both use the same schema for creates and updates for reasons explained above). The rest of entities don't use uuid_validator
on updates. So if an old system created resource ids like 13857D04-A35B-4834-93F1-5DD98D401D44
, when updating the resource in the new CKAN version it would change to 13857d04-a35b-4834-93f1-5dd98d401d44
. Would that be an issue assuming it is mentioned in the changelog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, Bad Things™ happen when running this code in existing upper case resource ids:
ckanapi action package_create name=test_resource owner_org=test
ckanapi action resource_create package_id=test_resource id=C246F35C-7EAE-4F05-B971-99ED3A878B07
# Switch to code running with the uuid_validator version normalizing the value:
ckanapi action resource_update id=C246F35C-7EAE-4F05-B971-99ED3A878B07 description=test
ERROR [ckan.logic] Resource C246F35C-7EAE-4F05-B971-99ED3A878B07 exists but it is not found in the package it should belong to.
[...]
File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/action/update.py", line 119, in resource_update
resource = _get_action('resource_show')(context, {'id': id})
File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/__init__.py", line 578, in wrapped
result = _action(context, data_dict, **kw)
File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/action/get.py", line 1098, in resource_show
raise NotFound(_('Resource was not found.'))
ckan.logic.NotFound: Resource was not found.
Looking at the DB, the existing upper case id resource was deleted and a new resource was created for the lower case version. There's a resource_show
call at the end of resource_update
which fails because it tries to search for the old upper case id (which is now deleted) in the dataset dict and fails.
So all in all, I don't think we can introduce this change with an extra database migration that lower cases all ids, and even then we might break external systems that rely on the existing upper case ids (although this might be fine if we document it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're not going ahead with normalizing UUIDs for this iteration, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if we want to normalize UUIDs we need to introduce a migration that changes all existing UUIDS to lowercase, or at least the resource ones to prevent the error I described, and clearly document it. I don't think its worth doing it given the potential for breakages. But what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that clean up can be part of a future release. Maybe migrate our columns to the proper UUID type at the same time for better performance.
@wardi @smotornyuk any more comments on this? |
Not sure why a ton of tests started failing after merging master 😭 I'll investigate |
@smotornyuk these typing errors are related to the same changes we discussed in https://github.com/ckan/ckan/pull/7976/files#r1597518095 But in this case the methods like |
All is green again, this is ready to go |
@@ -0,0 +1,3 @@ | |||
* Only sysadmins can now set the ``id`` field of Datasets, Groups, Organizations, Users, Resource Views and Extras | |||
* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead. | |
* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can extend the relevant schema or validate method to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead. |
def form_to_db_schema_options(self, | ||
options: dict[str, Any]) -> dict[str, Any]: | ||
''' This allows us to select different schemas for different | ||
purpose eg via the web interface or via the api or creation vs | ||
''' [Deprecated] This allows us to select different schemas for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only deprecated, these won't be called at all any more even as a fall-back, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(looks like ckanext-scheming uses the validate
method of IGroupForm
it won't be affected by the change)
|
||
schema = default_extras_schema() | ||
|
||
schema["id"] = [ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this strip the ids out of the extras on show? If so I love it.
|
||
result = session.query(model.ResourceView).get(value) | ||
if result: | ||
raise Invalid(_('ResourceView id already exists')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these don't completely protect us. If a user is sending multiple requests to create something with the same id it's possible for one to overwrite the other unless we pass a flag through to the save code to have postgres only create the record, on commit we'll reliably get an error if there was a conflict on one of the id fields. That's going to be too late for adding an error to the validation but at least we could prevent bad things from happening.
OTOH maybe all that's not worth the effort if the only users that can pass ids are sysadmins.
Fixes #7604
Followup to the great work by @shashigharti in #7868
This opened several cans of worms, here's a summary of the work so far (still WIP)
empty_if_not_sysadmin
{entity}_id_does_not_exist
id_validator
(Valid UUID)[1] model_save uses the package_id/group_id + extra key to create / update extras rows, besides there's one
single schema for dataset and group extras so it's not easy to check for existing ids in the relevant table
[2] Activity ids can not be set by any user via the API
[3] Tag ids can not be set by any users via the API
Two things to discuss
resource_id_validator
which was introduced fairly recently:Do we want to stick with this for 2.11 or enforce true UUIDs going forward? Maybe we are more lenient (current validator) in some entities where people might have used some custom ones?
Decision: let's move forward with enforcing UUIDs. Users relying on custom ids can override their schemas to customize the validators used
empty_if_not_sysadmin
is not going to work. There is aninsanereally confusing number of methods used to get the schemas, probably inherited from an ancient CKAN time (form_to_db_schema_options()
,form_to_db_schema()
,form_to_db_schema_api_create()
...). Sean already simplified the dataset schemas a very long time ago, and it looks like we could do the same here. I'll keep digging.Decision: Aim to simplify the methods on the group plugins and align them with the simplified dataset version. Check how that affects scheming. Need to check what's possible for users.