Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache _has_init calls to avoid repeated stats #2429

Merged

Conversation

correctmost
Copy link
Contributor

Description

_has_init can end up checking for the presence of the same files over and over.

For example, when running pylint's import-error checks on a codebase like yt-dlp, ~43,000 redundant stats were performed prior to caching.

Closes pylint-dev/pylint#9613.

Performance

I ran pylint with just the import-error checks on the yt-dlp codebase. With 2c38c02 and
pylint-dev/pylint@7521eb1, linting takes ~34.1 seconds. With the fix applied, linting takes ~33.8 seconds.

Type of Changes

Type
🔨 Refactoring

_has_init can end up checking for the presence of the same files
over and over.

For example, when running pylint's import-error checks on a
codebase like yt-dlp, ~43,000 redundant stats were performed prior
to caching.

Closes pylint-dev/pylint#9613.
Copy link

codecov bot commented May 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.78%. Comparing base (2c38c02) to head (b73457f).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2429   +/-   ##
=======================================
  Coverage   92.78%   92.78%           
=======================================
  Files          94       94           
  Lines       11098    11099    +1     
=======================================
+ Hits        10297    10298    +1     
  Misses        801      801           
Flag Coverage Δ
linux 92.59% <100.00%> (+<0.01%) ⬆️
pypy 92.78% <100.00%> (+<0.01%) ⬆️
windows 92.68% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
astroid/manager.py 89.58% <ø> (ø)
astroid/modutils.py 88.19% <100.00%> (+0.04%) ⬆️

@DanielNoord
Copy link
Collaborator

Wouldn't it be better to cache check_modpath_has_init and file_info_from_modpath?

@correctmost
Copy link
Contributor Author

Wouldn't it be better to cache check_modpath_has_init and file_info_from_modpath?

Hmm, I'm not sure :).

These are the stats for the two functions with this PR applied:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 33329    0.054    0.000   11.868    0.000 astroid/modutils.py:340(file_info_from_modpath)
  1173    0.003    0.000    0.010    0.000 astroid/modutils.py:238(check_modpath_has_init)

check_modpath_has_init does not seem too costly compared to other functions. file_info_from_modpath has a lot of callees, which makes it harder to determine if caching is safe for the entire function. Both functions also accept non-hashable lists, which would require me to introduce some tuple conversion code for the arguments.

I think my preference is to use caching with smaller function blocks because it makes it easier to identify side effects or statefulness which could make caching unsafe.

Would it be okay to revisit this in future tickets / PRs if file_info_from_modpath shows up as a hot spot? (I have another patch in the works that should reduce the cumulative runtime of the function by avoiding even more stats.)

@DanielNoord
Copy link
Collaborator

DanielNoord commented May 13, 2024

I'm not sure what other maintainers think of this.

Personally I'm not a big fan of slapping lru_cache everywhere as I think it actually makes it harder to reason about what is and isn't cached and you just multiply the usual issues with caches. Adding caches a little higher up in the callchain makes more sense to me.

@Pierre-Sassoulas
Copy link
Member

it actually makes it harder to reason about what is and isn't cached and you just multiply the usual issues with caches.

Disclaimer: I'm not a cache expert, I don't know what the usual issues with caches are so I'm not sure what's wrong with cache, and why we need to reason about it ? We had some issues with caching (memory leaks) in the past but now we're using lru_cache with a max value of entries and we also have an API to clear it, what would be the issues ? If we see that a function is called often enough that the caching overhead is worth it, and then do a big rework, we might have to reconsider the caching ?

That being said can't we just sed lru_cache and have a full list ? Also caching the smaller functions could have benefit even if we also cache the higher level functions if the higher level function call the lower level functions multiple time and those multiple calls are costlier than the caching overhead, right ?

@DanielNoord
Copy link
Collaborator

We had some issues with caching (memory leaks) in the past but now we're using lru_cache with a max value of entries and we also have an API to clear it, what would be the issues ?

We can of course still have caches in which the objects get too large, or caches that never hit. Every cache adds additional "maintainer" overhead as you need to reason about whether it is worth it. That's why I would prefer to put them a little higher up to have more effect and reduce their overall number.

That being said can't we just sed lru_cache and have a full list ? Also caching the smaller functions could have benefit even if we also cache the higher level functions if the higher level function call the lower level functions multiple time and those multiple calls are costlier than the caching overhead, right ?

Yes of course this is all true. I just pointed out that since this is only called by 2 functions and is an internal function it might be make more sense to look a little higher up.

@Pierre-Sassoulas
Copy link
Member

Every cache adds additional "maintainer" overhead as you need to reason about whether it is worth it.

I see thank you for explaining. While it's true, I agree with @correctmost point that caching small functions is simpler to reason about than caching big functions. I'm not sure that the overall maintainance burden would be better with 2 big vs 4/5 smalls. Also as they said:

Both functions also accept non-hashable lists, which would require me to introduce some tuple conversion code for the arguments.

Overall, I'm neutral on the issue but I think it shouldn't block the amazing work done by @correctmost atm. There's a lot of provided benchmarks so we can clearly see that this caching is required -- for the moment at least. This could have gone in 3.2.0 if I didn't wait for answering 😅

(I have another patch in the works that should reduce the cumulative runtime of the function by avoiding even more stats.)

And this one too.

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the amazing benchmarks @correctmost

@Pierre-Sassoulas Pierre-Sassoulas merged commit e43e045 into pylint-dev:main May 17, 2024
21 checks passed
@jacobtylerwalls
Copy link
Member

Sliding in only for this tangent:

Disclaimer: I'm not a cache expert, I don't know what the usual issues with caches are so I'm not sure what's wrong with cache, and why we need to reason about it ?

Looks like we made the right change here, but I think healthy cache skepticism is always worth saluting. 🫡

Someone who I never worked with but who left an influence at a place I used to work at shared this custom slack alert he wrote to notify on mentions of "architecturally significant events" like "trivial", "should work", "cache", and "redis". 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E0401 (import-error) checks perform repeated _has_init and stat calls
4 participants