Tighten validation for id fields (followup) #8069

amercader · 2024-02-09T10:55:13Z

Followup to the great work by @shashigharti in #7868

This opened several cans of worms, here's a summary of the work so far (still WIP)

Entity	`empty_if_not_sysadmin`	`{entity}_id_does_not_exist`	`id_validator` (Valid UUID)
Dataset	existing	existing	added
Resource	n/a (see #8069 (comment))	existing	existing
Resource View	added	added	added
Groups (org/groups)	added	added	added
User	added	added	added
Extras	added	n/a [1]	added
Activity	n/a [2]	n/a	n/a
Tag	n/a [3]	n/a	n/a

[1] model_save uses the package_id/group_id + extra key to create / update extras rows, besides there's one
single schema for dataset and group extras so it's not easy to check for existing ids in the relevant table

[2] Activity ids can not be set by any user via the API

[3] Tag ids can not be set by any users via the API

Two things to discuss

Id validator. The only version of this we have right now is the resource_id_validator which was introduced fairly recently:

def resource_id_validator(value):
    pattern = re.compile("[^0-9a-zA-Z _-]")
    if pattern.search(value):
        raise Invalid(_('Invalid characters in resource id'))
    if len(value) < 7 or len(value) > 100:
        raise Invalid(_('Invalid length for resource id'))
    return value

Do we want to stick with this for 2.11 or enforce true UUIDs going forward? Maybe we are more lenient (current validator) in some entities where people might have used some custom ones?

from uuid import UUID

def valid_uuid(value)
    try:
        UUID(value, version=4)
    except ValueError:
        raise Invalid(_("Invalid id provided"))

    return value

Decision: let's move forward with enforcing UUIDs. Users relying on custom ids can override their schemas to customize the validators used

Group and User schemas. The thing is a mess but basically the problem is that groups use the same schema for creation and updates so using a validator like empty_if_not_sysadmin is not going to work. There is an ~~insane~~ really confusing number of methods used to get the schemas, probably inherited from an ancient CKAN time (form_to_db_schema_options(), form_to_db_schema(), form_to_db_schema_api_create()...). Sean already simplified the dataset schemas a very long time ago, and it looks like we could do the same here. I'll keep digging.

Decision: Aim to simplify the methods on the group plugins and align them with the simplified dataset version. Check how that affects scheming. Need to check what's possible for users.

…b.com/shashigharti/ckan into shashigharti-7604/tighten-validation-for-id-fields

Following the pattern in the IDatasetForm, consolidate all schemas around three variants: * `create_group_schema` (base) * `update_group_schema` * `show_group_schema` All of them with defaults in `ckan/logic/schema.py`. The old methods have been kept for backwards compatibility.

Because we can not use different schemas for resources for update and create, this validators prevents resource updates for normal users.

amercader · 2024-02-20T12:22:11Z

Update on the current status:

UUIDs are now enforced in id fields (main commit is c48bc71, with some followups just to fix tests)
Simplified schemas handling in IGroupForm (8922684)
User schemas just required minor tweaks, I'll leave for a future PR simplifying them and implementing proper IUserForm support
Resources need a special case, because of the way resources are created and updated (several action paths that all end up in package_update). Package updates will use the update_package_schema but it is difficult (impossible?) to use separate schemas for newly added resources (i.e. a create schema) and for existing ones (i.e. an update one), which might even be present in the same input data_dict. So we can not use the pattern of using a create schema with a "id can not exist" / "empty if not sysadmin" validator and an update schema with "id must exist" validator. So I think that for resources alone, we will need to remove the empty_if_not_sysadmin validator. I think that the safeguards provided by resource_id_does_not_exist should be safe enough (This validator should actually be called resource_id_does_not_exist_in_another_dataset)

smotornyuk · 2024-03-18T13:08:40Z

ckan/logic/validators.py

-        raise Invalid(_('Invalid length for resource id'))
+def uuid_validator(value: Any) -> Any:
+    try:
+        uuid.UUID(value, version=4)


UUIDs are case-insensitive. How about returning str(uuid.UUID(value, version=4)) to normalize the value? In this way, we'll avoid creating entries with the same UUID written in different cases (we have a similar problem for user emails)

💯 this. A decade ago we imported an organization list with uppercase UUIDs and now they're mixed with new orgs created with lowercase UUIDs. So untidy.

This makes sense, but if we change the value returned we might change the case of some existing upper case UUIDs when editing objects, wouldn't that cause potential issues?

To add to this, this would only be an issue for resources and extras (both use the same schema for creates and updates for reasons explained above). The rest of entities don't use uuid_validator on updates. So if an old system created resource ids like 13857D04-A35B-4834-93F1-5DD98D401D44, when updating the resource in the new CKAN version it would change to 13857d04-a35b-4834-93f1-5dd98d401d44. Would that be an issue assuming it is mentioned in the changelog?

Confirmed, Bad Things™ happen when running this code in existing upper case resource ids:

ckanapi action package_create name=test_resource owner_org=test ckanapi action resource_create package_id=test_resource id=C246F35C-7EAE-4F05-B971-99ED3A878B07 # Switch to code running with the uuid_validator version normalizing the value: ckanapi action resource_update id=C246F35C-7EAE-4F05-B971-99ED3A878B07 description=test ERROR [ckan.logic] Resource C246F35C-7EAE-4F05-B971-99ED3A878B07 exists but it is not found in the package it should belong to. [...] File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/action/update.py", line 119, in resource_update resource = _get_action('resource_show')(context, {'id': id}) File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/__init__.py", line 578, in wrapped result = _action(context, data_dict, **kw) File "/home/adria/dev/pyenvs/ckan-py3.9/ckan/ckan/logic/action/get.py", line 1098, in resource_show raise NotFound(_('Resource was not found.')) ckan.logic.NotFound: Resource was not found.

Looking at the DB, the existing upper case id resource was deleted and a new resource was created for the lower case version. There's a resource_show call at the end of resource_update which fails because it tries to search for the old upper case id (which is now deleted) in the dataset dict and fails.

So all in all, I don't think we can introduce this change with an extra database migration that lower cases all ids, and even then we might break external systems that rely on the existing upper case ids (although this might be fine if we document it)

So we're not going ahead with normalizing UUIDs for this iteration, right?

I think that if we want to normalize UUIDs we need to introduce a migration that changes all existing UUIDS to lowercase, or at least the resource ones to prevent the error I described, and clearly document it. I don't think its worth doing it given the potential for breakages. But what do you think?

Agreed, that clean up can be part of a future release. Maybe migrate our columns to the proper UUID type at the same time for better performance.

amercader · 2024-05-09T08:57:52Z

@wardi @smotornyuk any more comments on this?

…d-fields

amercader · 2024-05-14T10:44:55Z

Not sure why a ton of tests started failing after merging master 😭 I'll investigate

amercader · 2024-05-14T12:34:08Z

@smotornyuk these typing errors are related to the same changes we discussed in https://github.com/ckan/ckan/pull/7976/files#r1597518095

But in this case the methods like create_group_schema() are expected to be present. Do we need to add these methods to IGroupForm (right now they come from the DefaultGroupForm)? or just type: ignore?

amercader · 2024-05-15T11:22:56Z

All is green again, this is ready to go

wardi · 2024-05-15T13:08:45Z

changes/8069.migration

@@ -0,0 +1,3 @@
+* Only sysadmins can now set the ``id`` field of Datasets, Groups, Organizations, Users, Resource Views and Extras
+* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.


Suggested change

* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.

* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can extend the relevant schema or validate method to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.

wardi · 2024-05-15T20:52:17Z

ckan/lib/plugins.py

    def form_to_db_schema_options(self,
                                  options: dict[str, Any]) -> dict[str, Any]:
-        ''' This allows us to select different schemas for different
-        purpose eg via the web interface or via the api or creation vs
+        ''' [Deprecated] This allows us to select different schemas for


Not only deprecated, these won't be called at all any more even as a fall-back, right?

(looks like ckanext-scheming uses the validate method of IGroupForm it won't be affected by the change)

wardi · 2024-05-15T20:57:11Z

ckan/logic/schema.py

+
+    schema = default_extras_schema()
+
+    schema["id"] = [ignore]


does this strip the ids out of the extras on show? If so I love it.

wardi · 2024-05-15T23:26:56Z

ckan/logic/validators.py

+
+    result = session.query(model.ResourceView).get(value)
+    if result:
+        raise Invalid(_('ResourceView id already exists'))


Note that these don't completely protect us. If a user is sending multiple requests to create something with the same id it's possible for one to overwrite the other unless we pass a flag through to the save code to have postgres only create the record, on commit we'll reliably get an error if there was a conflict on one of the id fields. That's going to be too late for adding an error to the validation but at least we could prevent bad things from happening.

OTOH maybe all that's not worth the effort if the only users that can pass ids are sysadmins.

shashigharti and others added 21 commits October 19, 2023 08:31

Add validation and test to Resource

0ff9e93

Fix for failing test

482b796

Rename resource_id_validator function to id_validator

19937f0

Add validation for Resource View

7669403

Add validator to check unique id value

df1e66f

Add validation to group

fd771ce

change test name

1b580c3

Added validation for package extra

3308546

Added id validation for group extras

7d6c4de

Add validation in tag

f08e4d8

Add id validator for activity

946a07d

Merge branch '7604/tighten-validation-for-id-fields' of https://githu…

a230be1

…b.com/shashigharti/ckan into shashigharti-7604/tighten-validation-for-id-fields

Add id_validator to datasets

4f85b61

Tag ids can not be set by users

0c9daaa

Activity ids can not be set via the API

dcf84e8

Consolidate order of the validators

393e17f

Separate validator for user id

e3d6200

Separate schema for extras show

0ed63cc

Better message for id validator

bb8e533

Prevent further validation on empty_if_not_sysadmin

9fbd0b9

No id validation on group show

18c19f5

amercader marked this pull request as draft February 9, 2024 10:56

amercader added 8 commits February 9, 2024 12:33

User id or name when updating

09c00c9

Fix validators for resource update

6af7cd0

Fix id length in test

ea1d824

Enforce actual UUIDs as ids

c48bc71

Fix more uuid tests

22e4cb5

Fix more uuid tests

24da1f1

Fix schema for user form

3eca61b

amercader added 2 commits February 20, 2024 13:10

Remove empty_if_not_sysadmin from resource ids

5868a7b

Because we can not use different schemas for resources for update and create, this validators prevents resource updates for normal users.

Lint

920a8df

amercader marked this pull request as ready for review February 20, 2024 12:57

wardi assigned wardi, smotornyuk and EricSoroos Feb 20, 2024

amercader added this to the CKAN 2.11 milestone Feb 20, 2024

Add changelog

dc6bbad

smotornyuk reviewed Mar 18, 2024

View reviewed changes

amercader mentioned this pull request Apr 4, 2024

Tighten validation for id fields #7868

Closed

4 tasks

Merge branch 'master' into shashigharti-7604/tighten-validation-for-i…

d9968bb

…d-fields

Re-add schema calls lost during last merge

48cde8e

amercader added 2 commits May 15, 2024 13:03

Add schema methods to IGroupForm

0c7daf9

lint

3758551

wardi reviewed May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten validation for id fields (followup) #8069

Tighten validation for id fields (followup) #8069

amercader commented Feb 9, 2024 •

edited

amercader commented Feb 20, 2024

smotornyuk Mar 18, 2024

wardi Mar 18, 2024

amercader Apr 25, 2024

amercader Apr 25, 2024

amercader Apr 26, 2024

wardi May 9, 2024

amercader May 9, 2024

wardi May 9, 2024

amercader commented May 9, 2024

amercader commented May 14, 2024

amercader commented May 14, 2024

amercader commented May 15, 2024

wardi May 15, 2024 •

edited

wardi May 15, 2024

wardi May 15, 2024

wardi May 15, 2024

wardi May 15, 2024

		@@ -0,0 +1,3 @@
		* Only sysadmins can now set the ``id`` field of Datasets, Groups, Organizations, Users, Resource Views and Extras
		* If provided, the value of the ``id`` field needs to be a valid UUID v4 string. Sites using custom ids that are not UUIDs can uextend the relevant schema to override the validation on the ``id`` field, but are strongly encouraged to use a separate custom field to store the custom id instead.

Tighten validation for id fields (followup) #8069

Are you sure you want to change the base?

Tighten validation for id fields (followup) #8069

Conversation

amercader commented Feb 9, 2024 • edited

amercader commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amercader commented May 9, 2024

amercader commented May 14, 2024

amercader commented May 14, 2024

amercader commented May 15, 2024

wardi May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amercader commented Feb 9, 2024 •

edited

wardi May 15, 2024 •

edited