Decode and deduplicate tags during ingestion #4143

obulat · 2024-04-17T08:35:45Z

Fixes

Description

This is an alternative to #4142 that decodes and deduplicates tags in the cleanup process during ingestion instead of the API serializer. I opened it as a reply to @sarayourfriend's comment in #4142.

This might make the cleanup process longer than usual, so I would prefer to add these changes once we have the process of saving the updated data to a TSV(#3912) and updating the upstream catalog with it (#3415), from Data normalization project.

It is important to fix the encoding quickly because it can cause a gray Nuxt error screen for pages that contain tags with character sequences that cannot be URI-encoded (such as udadd from "ciudaddelassiencias").

Update

I realized that we do need to update the incorrect decoding of combinations such as udada in the frontend in any case to fix all strings, not only the tags. That's why I opened #4144.

So, the question is now this: should we add tag decoding/deduplicating to the cleanup process during data refresh now, or only after the data normalization process is merged, @WordPress/openverse-catalog ? I'm afraid that this update could make the data refresh process too long, and think that it might be better to finish the data normalization project, and only add this cleanup after the other cleanup processes are removed.

Outdated alternatives, for the record

There are three alternatives to fix this:
1. Add a fix to the frontend fixes. We already do some decoding in the frontend, so we can add one step. Then, after the data normalization project pieces are in place, we can add the decoding/deduplicating of tags to the tag cleanup (this PR). After updating the upstream catalog data, we can remove the decoding/deduplicating from both the frontend and the ingestion.
2. Add the decoding and deduplicating of tags to the API serializer. Then, remove the frontend decoding/deduplicating. After the data normalization process is ready, add the decoding/deduplicating of tags to the tag cleanup (this PR). After updating the upstream catalog data, we can remove the decoding/deduplicating from both the API and the ingestion.
3. Only add the cleanup to the ingestion server. If the data cleanup process does not become too much longer, use it until the data normalization pieces are ready, and remove the cleanup steps from the frontend.~~

I would prefer the alternative number 3, but I'm a bit hesitant because I don't know how much longer data refresh will become.
After writing these steps, I think I'd prefer alternative number 1 because it has fewer changes (we already do the cleanup in the frontend).
What do you think?_

Testing Instructions

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend · 2024-04-17T09:38:57Z

It occurs to me after reading what you said about the data refresh, at the point at which we are adding this to clean up, is it actually faster to do the clean up in the catalog once and for all?

A hot-fix on the frontend is, I think, preferred. These searches are behind unstable parameters, so I don't feel like we need to fix them in the API now (it's an edge-case bug, okay, that's fine, the method is labelled as potentially weird/changing/etc).

If the frontend hot-fix holds us over to be able to do the clean up in the catalog, then it wouldn't affect data refresh timing, right?

I can't remember if this is a very hard task, though, doing an iterative cleanup of records in the catalog. Would it take longer to do in the catalog than in a single data refresh? I guess that's my main question.

Curious to hear from @WordPress/openverse-catalog, but based on what you've said, I think putting the simpler fix into the frontend, where we already decode anyway (and maybe pass the tag list through Set to deduplicate?) is a better stop-gap if we can then just do this at the catalog level, even if it takes a couple of weeks for a DAG to do the iterations.

obulat · 2024-04-17T10:32:08Z

It occurs to me after reading what you said about the data refresh, at the point at which we are adding this to clean up, is it actually faster to do the clean up in the catalog once and for all?

Cleaning up the catalog is very difficult because we have so much data. Even selecting items based on fields other than the indexed fields can be extremely slow. That's why @krysal is working on the data normalization project: it will save the data from the cleanup process during ingestion in the shape of identifier,<cleaned field>. Selecting using identifier and updating them should be faster because identifier is indexed in the catalog.
Once the 2 pieces (saving the cleaned data from the ingestion and updating the catalog using the saved TSV from step 1) are ready, we can do a faster catalog clean up once and for all. Then we can remove the hot-fixes from the frontend and the API.

A hot-fix on the frontend is, I think, preferred. These searches are behind unstable parameters, so I don't feel like we need to fix them in the API now (it's an edge-case bug, okay, that's fine, the method is labelled as potentially weird/changing/etc).

I looked through the code again, and agree with you now. I opened the frontend PR #4144, and will close the API one. Thank you for reviewing and improving it! I've added some changes you suggested to the frontend fix.

If the frontend hot-fix holds us over to be able to do the clean up in the catalog, then it wouldn't affect data refresh timing, right?

I don't understand what "hot-fix holds us over" mean here, sorry. The data refresh timing will not be affected by the frontend hot fix, but the fix in this PR will. That's why we I think we should not add these changes until we get the data normalization working - after which we will only need to run this cleanup once during data refresh.

ingestion_server/ingestion_server/cleanup.py

sarayourfriend · 2024-04-17T21:23:37Z

I don't understand what "hot-fix holds us over" mean here, sorry. The data refresh timing will not be affected by the frontend hot fix, but the fix in this PR will. That's why we I think we should not add these changes until we get the data normalization working - after which we will only need to run this cleanup once during data refresh.

I'm not familiar with the details of the data normalisation project, sorry if my ignorance here is causing confusing suggestions. I just mean that the frontend fix is, to me, an acceptable temporary solution provided we are able to fix this at the catalog level once, rather than doing it on every data refresh.

Cleaning up the catalog is very difficult because we have so much data

The data refresh deals with the same amount of data, doesn't it? And does so by iterating through all fields using the identifier as the key. Couldn't the catalog do the same, accepting that a large number of rows would have noops for this process (rather than trying to exclude them from the query which isn't possible due to the lack of usasble indexes)? Like I said, I'm not familiar with the data normalisation project, but why isn't it possible to iterate through all records in the catalog (rather than trying to select the relevant records) and apply a transformation in place, using the identifier as the key field (which is indexed)? Maybe this question is covered in the data normalisation project, so I will take some time to read up on it today or early next week.

krysal · 2024-04-17T21:36:15Z

@obulat I'll prioritize #3912, and it will be my next issue to work on.

@sarayourfriend The root of the problem is that the catalog DB doesn't have indices like the API DB, so it's much slower to find and update rows. It's possible to do it still, but it takes much longer when we don't have the identifiers of the rows we want to update in advance.

sarayourfriend · 2024-04-17T21:46:30Z

It's possible to do it still, but it takes much longer when we don't have the identifiers of the rows we want to update in advance.

Understood. But the ingestion server approach iterates through every record every data refresh as it is. Is the data refresh able to do this faster than a DAG, for some reason? Even if it's slow for some other reason, doing it once in the catalog is surely less time and CPU cycles even in the near term than doing it every single data refresh? Of course it is nice to avoid iterating through rows we know don't need transformations, but the data refresh doesn't do that either, it checks each and every record in Python (only skipping, I suppose, the relatively small number of deindexed works).

obulat · 2024-04-18T08:44:45Z

Understood. But the ingestion server approach iterates through every record every data refresh as it is. Is the data refresh able to do this faster than a DAG, for some reason?

I believe that we don't have to worry about writes and database locks with data refresh, whereas the catalog database should always be writeable by DAGs

AetherUnbound · 2024-05-23T18:40:46Z

@obulat and @krysal - is this still something we want to accomplish in this manner in the near term? Should we prioritize this before the removal of the ingestion server?

stacimc · 2024-05-27T18:29:14Z

Jumping in here, so apologies if I'm missing a lot of context -- is this still broadly the plan for deploying these changes:

we can first add this step to the data refresh process' cleanup step, save the cleaned up data, run the csv batched update dag to add this data to the catalog, and then remove this cleanup from the data refresh.

(Taken from this comment.)

Just checking that the intention is for this to be included for one data refresh and then removed, (which I think addresses some of @sarayourfriend's concern). This is also necessary in order to not break with our stated goals of removing cleanup steps from the data refresh in the Ingestion Server Removal project, to @AetherUnbound's point about prioritizing this quickly.

krysal · 2024-05-27T23:07:03Z

@obulat and @krysal - is this still something we want to accomplish in this manner in the near term? Should we prioritize this before the removal of the ingestion server?

Yes, and yes, for the reasons that @stacimc stated, but to clarify, we expect all the clean steps in the ingestion server to be unnecessary eventually after this milestone is done. @obulat has also warned us about adding this additional step before more progress is made because the data refresh will potentially take much longer to complete; this is why the PR is in draft mode.

Signed-off-by: Olga Bulat <obulat@gmail.com>

sarayourfriend · 2024-06-01T02:35:04Z

I believe that we don't have to worry about writes and database locks with data refresh, whereas the catalog database should always be writeable by DAGs

I don't want to derail this, but this has come up so many times, and I'm very confused by some of the decisions that have been made. I've stayed out of the discussion for lack of familiarity with the catalog database, but I recently learned about the batched update DAG, and am more confused than ever about the reasons and rationale I've heard for avoiding updates to data in the catalog database, or why there is so much complexity in these types of operations on the data. I do not understand where the problem is coming from, what causes locking to the table, or how any locking that does exist could be a problem, if the batched update DAG works (which we know it does).

My understanding is that the batched update DAG iterates through the catalog database in batches, and applies transformations to rows. It does not lock a table to update a row (keyed by an indexed field, such as the primary key), otherwise the batched update DAG would never work. Furthermore, Postgres doesn't even have locks on writes, and you have to explicitly request a lock if you want one. Even when a row is locked, it is only locked to other writers, not to queryers (https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS).

Provided we aren't making changes to the columns or data types, what is the specific limitation we are running up against when making changes to the catalog database that cause so much time and effort spent on what appear to be, on the face, trivial data transformation tasks?

The existence of this limitation has been referenced many times, but I've never seen a clear explanation for why iterating through the catalog database and updating rows one by one would cause any problems. I've recently learned about the batched update DAG, and am left even more confused, because it clearly is possible to iterate through the catalog database and make updates to rows. Sometimes locks are referenced, sometimes a lack of indexes are referenced, but both of these seem to misunderstand the nature of the problem. There are no indexes that can cause the types of problems described, and there is no such problem of locking in Postgres that would affect this particular problem (iterating through each row individually or in batches, and making changes to the row, using the primary key as the identifier).

How much complexity are we adding to which parts of our system, and how many things are we jumping through hoops to accomplish that could be a simple Python operator that takes a row and gives back the cleaned version, run by a DAG that goes through each work one-by-one. Even if it took a week for each pass (which I doubt, considering the data refresh doesn't take nearly that long, but we could go slower to avoid hogging resources on the catalog box), I don't see where the limitation is coming from that prevents this from working.

The batched update DAG seems to prove that it's possible, and that the overhead of it is non-existent or negligible. What am I missing? What prevents us from moving these in-Python transformations from the ingestion server into a Python operator?

I am deeply confused by the way the conversations and decisions about catalog data transformations and the cleanup steps have been made so far, and the limitations that are referenced offhand. I do not wish to derail anything, or to imply that anything is being intentionally ignored or that this problem is easy or obvious. However, I am worried that there have been grave and deep misunderstandings of the technologies and problems we're working with, and that in fact those misunderstandings are even in conflict with actual things we can and do already do (batched update DAG). I am in particular worried that these misunderstandings have cost us dearly in project planning and implementation time, and have also caused difficulties and misunderstandings in the communication about these problems and the tools we're working with. I, at least, certainly feel uncertain in my understanding of what is happening, and feel like the explanations and rationales I am reading (and the way I have interpreted them) is in direct conflict with what I know to be true about the tools we have, and the problems we're trying to solve.

To be clear, to me the fundamental problem is: how do we iterate through all the catalog data and make selective transformations of some of the rows. This framing assumes that all rows are iterated through every time, rather than the theoretically more efficient version that only iterates through the rows that need transformation. However, that inefficiency is also solvable because we have indexed versions of our data where we can retrieve immutable keyed information (the identifier), so long as the queryable aspects are available in either the API database or Elasticsearch.

github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Apr 17, 2024

openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Apr 17, 2024

obulat force-pushed the fix/decode-and-deduplicate-tags-during-ingestion branch from 227569d to 368e00d Compare April 17, 2024 08:36

obulat added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 17, 2024

obulat commented Apr 17, 2024

View reviewed changes

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved

obulat commented Apr 17, 2024

View reviewed changes

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved

obulat force-pushed the fix/decode-and-deduplicate-tags-during-ingestion branch from a4ddb03 to 54883d6 Compare May 30, 2024 04:56

obulat added 2 commits May 31, 2024 10:24

Decode and deduplicate tags during ingestion cleanup

d3e4670

Signed-off-by: Olga Bulat <obulat@gmail.com>

Apply suggestions from code review

38a60ad

obulat force-pushed the fix/decode-and-deduplicate-tags-during-ingestion branch from 54883d6 to 38a60ad Compare May 31, 2024 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode and deduplicate tags during ingestion #4143

Decode and deduplicate tags during ingestion #4143

obulat commented Apr 17, 2024 •

edited

sarayourfriend commented Apr 17, 2024

obulat commented Apr 17, 2024

sarayourfriend commented Apr 17, 2024 •

edited

krysal commented Apr 17, 2024

sarayourfriend commented Apr 17, 2024 •

edited

obulat commented Apr 18, 2024

AetherUnbound commented May 23, 2024

stacimc commented May 27, 2024

krysal commented May 27, 2024

sarayourfriend commented Jun 1, 2024

Decode and deduplicate tags during ingestion #4143

Are you sure you want to change the base?

Decode and deduplicate tags during ingestion #4143

Conversation

obulat commented Apr 17, 2024 • edited

Fixes

Description

Update

Outdated alternatives, for the record

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend commented Apr 17, 2024

obulat commented Apr 17, 2024

sarayourfriend commented Apr 17, 2024 • edited

krysal commented Apr 17, 2024

sarayourfriend commented Apr 17, 2024 • edited

obulat commented Apr 18, 2024

AetherUnbound commented May 23, 2024

stacimc commented May 27, 2024

krysal commented May 27, 2024

sarayourfriend commented Jun 1, 2024

obulat commented Apr 17, 2024 •

edited

sarayourfriend commented Apr 17, 2024 •

edited

sarayourfriend commented Apr 17, 2024 •

edited