Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of concurrent dead link requests #4333

Closed
wants to merge 3 commits into from

Conversation

sarayourfriend
Copy link
Contributor

@sarayourfriend sarayourfriend commented May 15, 2024

Fixes

Fixes #4332 by @sarayourfriend

Description

Limits the number of concurrent dead link requests by blocking on a semaphore before sending a request. The semaphore enforces the limit.

This is a draft because I need to think about how to write tests for this. Manual testing appears to prove this works in the happy path, but I'm not convinced that's enough!

I would also love input on the maximum to enforce.

The new logs are set to info, but I'll reduce them to debug after testing. It's easier to keep them at info right now, otherwise if debug logs are enabled there's too much to sort through to find the semaphore ones.

Testing Instructions

Run the API locally with just api/up and run just logs web to view the logs.

Make a search request with filter_dead=True and a large page size. The large page size helps illustrate the issue.

The following patch makes this easier to test by bypassing cached values and reducing the number of unrelated logs that make it hard to follow the semaphore logs:

diff --git a/api/api/utils/check_dead_links/__init__.py b/api/api/utils/check_dead_links/__init__.py
index 0afeebe63..dcb4623b5 100644
--- a/api/api/utils/check_dead_links/__init__.py
+++ b/api/api/utils/check_dead_links/__init__.py
@@ -133,8 +133,8 @@ def check_dead_links(
     # Anything that isn't in the cache needs to be validated via HEAD request.
     to_verify = {}
     for idx, url in enumerate(image_urls):
-        if cached_statuses[idx] is None:
-            to_verify[url] = idx
+        # if cached_statuses[idx] is None:
+        to_verify[url] = idx
     logger.debug(f"len(to_verify)={len(to_verify)}")
 
     verified = _make_head_requests(to_verify.keys())
@@ -180,19 +180,20 @@ def check_dead_links(
         status_mapping = provider_status_mappings[provider]
 
         if status in status_mapping.unknown:
-            logger.warning(
-                "Image validation failed due to rate limiting or blocking. "
-                f"url={image_urls[idx]} "
-                f"status={status} "
-                f"provider={provider} "
-            )
+            ...
+            # logger.warning(
+            #     "Image validation failed due to rate limiting or blocking. "
+            #     f"url={image_urls[idx]} "
+            #     f"status={status} "
+            #     f"provider={provider} "
+            # )
         elif status not in status_mapping.live:
-            logger.info(
-                "Deleting broken image from results "
-                f"id={results[del_idx]['identifier']} "
-                f"status={status} "
-                f"provider={provider} "
-            )
+            # logger.info(
+            #     "Deleting broken image from results "
+            #     f"id={results[del_idx]['identifier']} "
+            #     f"status={status} "
+            #     f"provider={provider} "
+            # )
             # remove the result, mutating in place
             del results[del_idx]
             # update the result's position in the mask to indicate it is dead

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added the 🟨 priority: medium Not blocking but should be addressed soon label May 15, 2024
@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label May 15, 2024
@sarayourfriend sarayourfriend added 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels May 15, 2024
@sarayourfriend sarayourfriend force-pushed the fix/limit-concurrent-deadlink-checks branch from 11533b2 to 01fd7a0 Compare May 22, 2024 01:24
@sarayourfriend sarayourfriend force-pushed the fix/limit-concurrent-deadlink-checks branch from 01fd7a0 to 968ca11 Compare May 22, 2024 03:00
@sarayourfriend sarayourfriend force-pushed the fix/limit-concurrent-deadlink-checks branch from 968ca11 to 27e7e1f Compare May 22, 2024 03:04
Copy link
Contributor Author

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undrafting this now because it works, and I need input from others on whether it's worth moving forward with this approach.

The primary difference from my initial version is that this is a request-local limit, rather than a worker-global limit. That is, the concurrency of dead link requests is limited per search request, and so any global per worker request would be contingent on a worker-level request limit. uvicorn already enforces that with its own flow control, so theoretically having our own tool for preventing runaway sockets in our code should work together with the worker's overall limit.

When we configure the environment variable to enable the limit, we need to consider it within that context.

Pulling some production stats, estimating per worker requirements based on the canary service (which is a single worker and sent a proportional number of requests), these are the numbers we should keep in mind:

  • 50 requests per second
  • 2 second head request timeout
  • 400 work max pagination depth, ~300 p95, ~170 average
  • 50% assumed dead link ratio

return url, -1


# https://stackoverflow.com/q/55259755
@async_to_sync
async def _make_head_requests(urls: list[str]) -> list[tuple[str, int]]:
tasks = [asyncio.ensure_future(_head(url)) for url in urls]
session = await get_aiohttp_session()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving session getting here significantly reduces the number of log lines. get_aiohttp_session logs whether it's reusing or creating a new session for the loop on every call to it. Even in log storage, we could save money, let alone in the real cost to calling get_aiohttp_session, which isn't really free anyway.

It does not have any effect on this flow control issue, however (just to clarify).

Comment on lines 69 to 71
if not settings.LINK_VALIDATION_MAX_CONCURRENT_REQUESTS:
yield
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default of None for the setting means this won't have any effect when deployed, until we actually configure a setting via the environment variable. I'll share my thoughts elsewhere on what an appropriate starting point for this should be.

@@ -55,5 +55,6 @@ IS_PROXIED=False
#SENTRY_DSN=

FILTER_DEAD_LINKS_BY_DEFAULT=False
LINK_VALIDATION_MAX_CONCURRENT_REQUESTS=10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default for local usage to make it easier to test.

@sarayourfriend sarayourfriend marked this pull request as ready for review May 22, 2024 04:07
@sarayourfriend sarayourfriend requested review from a team as code owners May 22, 2024 04:07
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at this again tomorrow. I haven't worked much with semaphores but looks interesting (and challenging at the same time!).

Comment on lines 94 to +99
# https://stackoverflow.com/q/55259755
@async_to_sync
async def _make_head_requests(urls: list[str]) -> list[tuple[str, int]]:
tasks = [asyncio.ensure_future(_head(url)) for url in urls]
# Limits total concurrent dead link requests per search request
# Must be created (and therefore bound to loop) within `_make_head_request`
# This prevents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents... ?


Besides, noticing the link as a comment mentions there is already a limit number of simultaneous connections to 100 by aiohttp.

It achieves by setting default limit to TCPConnector object that is used by ClientSession.

Is it something you've touched on before and this new code would be a second limit? Or could it be a simpler/more direct way to limit the number of requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't aware of that aiohttp feature. I need to look into it again! Maybe there's a simpler way to do this than manually tracking with a semaphore? 100 is rather far below the limit I would have proposed, so that's interesting 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah looking at it, it should be using a limit of 100 like you said.

This makes me far less confident this change would be meaningful. I think I'll close this PR, and we should think of logging we could add that might reveal where the issues are coming from?

What do you think, @krysal? Any ideas for logs we could add to try to reveal this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Status: 2‍⃣ Needs 2 Reviews
Development

Successfully merging this pull request may close these issues.

Limit the number of concurrent dead link requests
2 participants