Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten validation for id fields (followup) #8069

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

amercader
Copy link
Member

@amercader amercader commented Feb 9, 2024

Fixes #7604

Followup to the great work by @shashigharti in #7868

This opened several cans of worms, here's a summary of the work so far (still WIP)

Entity empty_if_not_sysadmin {entity}_id_does_not_exist id_validator (Valid UUID)
Dataset existing existing added
Resource n/a (see #8069 (comment)) existing existing
Resource View added added added
Groups (org/groups) added added added
User added added added
Extras added n/a [1] added
Activity n/a [2] n/a n/a
Tag n/a [3] n/a n/a

[1] model_save uses the package_id/group_id + extra key to create / update extras rows, besides there's one
single schema for dataset and group extras so it's not easy to check for existing ids in the relevant table

[2] Activity ids can not be set by any user via the API

[3] Tag ids can not be set by any users via the API

Two things to discuss

  1. Id validator. The only version of this we have right now is the resource_id_validator which was introduced fairly recently:
def resource_id_validator(value):
    pattern = re.compile("[^0-9a-zA-Z _-]")
    if pattern.search(value):
        raise Invalid(_('Invalid characters in resource id'))
    if len(value) < 7 or len(value) > 100:
        raise Invalid(_('Invalid length for resource id'))
    return value

Do we want to stick with this for 2.11 or enforce true UUIDs going forward? Maybe we are more lenient (current validator) in some entities where people might have used some custom ones?

from uuid import UUID

def valid_uuid(value)
    try:
        UUID(value, version=4)
    except ValueError:
        raise Invalid(_("Invalid id provided"))

    return value

Decision: let's move forward with enforcing UUIDs. Users relying on custom ids can override their schemas to customize the validators used

  1. Group and User schemas. The thing is a mess but basically the problem is that groups use the same schema for creation and updates so using a validator like empty_if_not_sysadmin is not going to work. There is an insane really confusing number of methods used to get the schemas, probably inherited from an ancient CKAN time (form_to_db_schema_options(), form_to_db_schema(), form_to_db_schema_api_create()...). Sean already simplified the dataset schemas a very long time ago, and it looks like we could do the same here. I'll keep digging.

Decision: Aim to simplify the methods on the group plugins and align them with the simplified dataset version. Check how that affects scheming. Need to check what's possible for users.

@amercader amercader marked this pull request as draft February 9, 2024 10:56
Following the pattern in the IDatasetForm, consolidate all schemas
around three variants:

* `create_group_schema` (base)
* `update_group_schema`
* `show_group_schema`

All of them with defaults in `ckan/logic/schema.py`. The old methods
have been kept for backwards compatibility.
Because we can not use different schemas for resources for update and
create, this validators prevents resource updates for normal users.
@amercader
Copy link
Member Author

Update on the current status:

  • UUIDs are now enforced in id fields (main commit is c48bc71, with some followups just to fix tests)
  • Simplified schemas handling in IGroupForm (8922684)
  • User schemas just required minor tweaks, I'll leave for a future PR simplifying them and implementing proper IUserForm support
  • Resources need a special case, because of the way resources are created and updated (several action paths that all end up in package_update). Package updates will use the update_package_schema but it is difficult (impossible?) to use separate schemas for newly added resources (i.e. a create schema) and for existing ones (i.e. an update one), which might even be present in the same input data_dict. So we can not use the pattern of using a create schema with a "id can not exist" / "empty if not sysadmin" validator and an update schema with "id must exist" validator. So I think that for resources alone, we will need to remove the empty_if_not_sysadmin validator. I think that the safeguards provided by resource_id_does_not_exist should be safe enough (This validator should actually be called resource_id_does_not_exist_in_another_dataset)

@amercader amercader marked this pull request as ready for review February 20, 2024 12:57
@amercader amercader added this to the CKAN 2.11 milestone Feb 20, 2024
raise Invalid(_('Invalid length for resource id'))
def uuid_validator(value: Any) -> Any:
try:
uuid.UUID(value, version=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUIDs are case-insensitive. How about returning str(uuid.UUID(value, version=4)) to normalize the value? In this way, we'll avoid creating entries with the same UUID written in different cases (we have a similar problem for user emails)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 this. A decade ago we imported an organization list with uppercase UUIDs and now they're mixed with new orgs created with lowercase UUIDs. So untidy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, but if we change the value returned we might change the case of some existing upper case UUIDs when editing objects, wouldn't that cause potential issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, this would only be an issue for resources and extras (both use the same schema for creates and updates for reasons explained above). The rest of entities don't use uuid_validator on updates. So if an old system created resource ids like 13857D04-A35B-4834-93F1-5DD98D401D44, when updating the resource in the new CKAN version it would change to 13857d04-a35b-4834-93f1-5dd98d401d44. Would that be an issue assuming it is mentioned in the changelog?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, Bad Things™ happen when running this code in existing upper case resource ids:

ckanapi action package_create name=test_resource owner_org=test

ckanapi action resource_create package_id=test_resource id=C246F35C-7EAE-4F05-B971-99ED3A878B07

# Switch to code running with the uuid_validator version normalizing the value:

ckanapi action resource_update id=C246F35C-7EAE-4F05-B971-99ED3A878B07 description=test

ERROR [ckan.logic] Resource C246F35C-7EAE-4F05-B971-99ED3A878B07 exists but it is not found in the package it should belong to.
[...]
  File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/action/update.py", line 119, in resource_update                                 
    resource = _get_action('resource_show')(context, {'id': id})     
  File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/__init__.py", line 578, in wrapped                                              
    result = _action(context, data_dict, **kw)                       
  File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/action/get.py", line 1098, in resource_show                                     
    raise NotFound(_('Resource was not found.'))                     
ckan.logic.NotFound: Resource was not found.  

Looking at the DB, the existing upper case id resource was deleted and a new resource was created for the lower case version. There's a resource_show call at the end of resource_update which fails because it tries to search for the old upper case id (which is now deleted) in the dataset dict and fails.

So all in all, I don't think we can introduce this change with an extra database migration that lower cases all ids, and even then we might break external systems that rely on the existing upper case ids (although this might be fine if we document it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're not going ahead with normalizing UUIDs for this iteration, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we want to normalize UUIDs we need to introduce a migration that changes all existing UUIDS to lowercase, or at least the resource ones to prevent the error I described, and clearly document it. I don't think its worth doing it given the potential for breakages. But what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that clean up can be part of a future release. Maybe migrate our columns to the proper UUID type at the same time for better performance.

@amercader amercader mentioned this pull request Apr 4, 2024
4 tasks
@amercader
Copy link
Member Author

@wardi @smotornyuk any more comments on this?

@amercader
Copy link
Member Author

Not sure why a ton of tests started failing after merging master 😭 I'll investigate

@amercader
Copy link
Member Author

@smotornyuk these typing errors are related to the same changes we discussed in https://github.com/ckan/ckan/pull/7976/files#r1597518095

But in this case the methods like create_group_schema() are expected to be present. Do we need to add these methods to IGroupForm (right now they come from the DefaultGroupForm)? or just type: ignore?

@amercader
Copy link
Member Author

All is green again, this is ready to go

@@ -0,0 +1,3 @@
* Only sysadmins can now set the ``id`` field of Datasets, Groups, Organizations, Users, Resource Views and Extras
* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.
Copy link
Contributor

@wardi wardi May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.
* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can extend the relevant schema or validate method to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.

def form_to_db_schema_options(self,
options: dict[str, Any]) -> dict[str, Any]:
''' This allows us to select different schemas for different
purpose eg via the web interface or via the api or creation vs
''' [Deprecated] This allows us to select different schemas for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only deprecated, these won't be called at all any more even as a fall-back, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(looks like ckanext-scheming uses the validate method of IGroupForm it won't be affected by the change)


schema = default_extras_schema()

schema["id"] = [ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this strip the ids out of the extras on show? If so I love it.


result = session.query(model.ResourceView).get(value)
if result:
raise Invalid(_('ResourceView id already exists'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these don't completely protect us. If a user is sending multiple requests to create something with the same id it's possible for one to overwrite the other unless we pass a flag through to the save code to have postgres only create the record, on commit we'll reliably get an error if there was a conflict on one of the id fields. That's going to be too late for adding an error to the validation but at least we could prevent bad things from happening.

OTOH maybe all that's not worth the effort if the only users that can pass ids are sysadmins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tighten validation for id fields
5 participants