Cache _has_init calls to avoid repeated stats #2429

correctmost · 2024-05-11T22:15:32Z

Description

_has_init can end up checking for the presence of the same files over and over.

For example, when running pylint's import-error checks on a codebase like yt-dlp, ~43,000 redundant stats were performed prior to caching.

Closes pylint-dev/pylint#9613.

Performance

I ran pylint with just the import-error checks on the yt-dlp codebase. With 2c38c02 and
pylint-dev/pylint@7521eb1, linting takes ~34.1 seconds. With the fix applied, linting takes ~33.8 seconds.

Type of Changes

	Type
✓	🔨 Refactoring

_has_init can end up checking for the presence of the same files over and over. For example, when running pylint's import-error checks on a codebase like yt-dlp, ~43,000 redundant stats were performed prior to caching. Closes pylint-dev/pylint#9613.

codecov · 2024-05-12T04:52:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.78%. Comparing base (2c38c02) to head (b73457f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2429   +/-   ##
=======================================
  Coverage   92.78%   92.78%           
=======================================
  Files          94       94           
  Lines       11098    11099    +1     
=======================================
+ Hits        10297    10298    +1     
  Misses        801      801

Flag	Coverage Δ
linux	`92.59% <100.00%> (+<0.01%)`	⬆️
pypy	`92.78% <100.00%> (+<0.01%)`	⬆️
windows	`92.68% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
astroid/manager.py	`89.58% <ø> (ø)`
astroid/modutils.py	`88.19% <100.00%> (+0.04%)`	⬆️

DanielNoord · 2024-05-12T08:56:26Z

Wouldn't it be better to cache check_modpath_has_init and file_info_from_modpath?

correctmost · 2024-05-12T12:38:21Z

Wouldn't it be better to cache check_modpath_has_init and file_info_from_modpath?

Hmm, I'm not sure :).

These are the stats for the two functions with this PR applied:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 33329    0.054    0.000   11.868    0.000 astroid/modutils.py:340(file_info_from_modpath)
  1173    0.003    0.000    0.010    0.000 astroid/modutils.py:238(check_modpath_has_init)

check_modpath_has_init does not seem too costly compared to other functions. file_info_from_modpath has a lot of callees, which makes it harder to determine if caching is safe for the entire function. Both functions also accept non-hashable lists, which would require me to introduce some tuple conversion code for the arguments.

I think my preference is to use caching with smaller function blocks because it makes it easier to identify side effects or statefulness which could make caching unsafe.

Would it be okay to revisit this in future tickets / PRs if file_info_from_modpath shows up as a hot spot? (I have another patch in the works that should reduce the cumulative runtime of the function by avoiding even more stats.)

DanielNoord · 2024-05-13T07:38:29Z

I'm not sure what other maintainers think of this.

Personally I'm not a big fan of slapping lru_cache everywhere as I think it actually makes it harder to reason about what is and isn't cached and you just multiply the usual issues with caches. Adding caches a little higher up in the callchain makes more sense to me.

Pierre-Sassoulas · 2024-05-13T21:48:55Z

it actually makes it harder to reason about what is and isn't cached and you just multiply the usual issues with caches.

Disclaimer: I'm not a cache expert, I don't know what the usual issues with caches are so I'm not sure what's wrong with cache, and why we need to reason about it ? We had some issues with caching (memory leaks) in the past but now we're using lru_cache with a max value of entries and we also have an API to clear it, what would be the issues ? If we see that a function is called often enough that the caching overhead is worth it, and then do a big rework, we might have to reconsider the caching ?

That being said can't we just sed lru_cache and have a full list ? Also caching the smaller functions could have benefit even if we also cache the higher level functions if the higher level function call the lower level functions multiple time and those multiple calls are costlier than the caching overhead, right ?

DanielNoord · 2024-05-15T06:58:59Z

We had some issues with caching (memory leaks) in the past but now we're using lru_cache with a max value of entries and we also have an API to clear it, what would be the issues ?

We can of course still have caches in which the objects get too large, or caches that never hit. Every cache adds additional "maintainer" overhead as you need to reason about whether it is worth it. That's why I would prefer to put them a little higher up to have more effect and reduce their overall number.

That being said can't we just sed lru_cache and have a full list ? Also caching the smaller functions could have benefit even if we also cache the higher level functions if the higher level function call the lower level functions multiple time and those multiple calls are costlier than the caching overhead, right ?

Yes of course this is all true. I just pointed out that since this is only called by 2 functions and is an internal function it might be make more sense to look a little higher up.

Pierre-Sassoulas · 2024-05-16T19:48:42Z

Every cache adds additional "maintainer" overhead as you need to reason about whether it is worth it.

I see thank you for explaining. While it's true, I agree with @correctmost point that caching small functions is simpler to reason about than caching big functions. I'm not sure that the overall maintainance burden would be better with 2 big vs 4/5 smalls. Also as they said:

Both functions also accept non-hashable lists, which would require me to introduce some tuple conversion code for the arguments.

Overall, I'm neutral on the issue but I think it shouldn't block the amazing work done by @correctmost atm. There's a lot of provided benchmarks so we can clearly see that this caching is required -- for the moment at least. This could have gone in 3.2.0 if I didn't wait for answering 😅

(I have another patch in the works that should reduce the cumulative runtime of the function by avoiding even more stats.)

And this one too.

Pierre-Sassoulas

Thank you for the amazing benchmarks @correctmost

jacobtylerwalls · 2024-05-17T11:38:30Z

Sliding in only for this tangent:

Disclaimer: I'm not a cache expert, I don't know what the usual issues with caches are so I'm not sure what's wrong with cache, and why we need to reason about it ?

Looks like we made the right change here, but I think healthy cache skepticism is always worth saluting. 🫡

Someone who I never worked with but who left an influence at a place I used to work at shared this custom slack alert he wrote to notify on mentions of "architecturally significant events" like "trivial", "should work", "cache", and "redis". 🤣

Pierre-Sassoulas added the topic-performance label May 12, 2024

Pierre-Sassoulas added this to the 3.3.0 milestone May 12, 2024

DanielNoord approved these changes May 16, 2024

View reviewed changes

Pierre-Sassoulas approved these changes May 17, 2024

View reviewed changes

Pierre-Sassoulas merged commit e43e045 into pylint-dev:main May 17, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache _has_init calls to avoid repeated stats #2429

Cache _has_init calls to avoid repeated stats #2429

correctmost commented May 11, 2024

codecov bot commented May 12, 2024 •

edited

DanielNoord commented May 12, 2024

correctmost commented May 12, 2024

DanielNoord commented May 13, 2024 •

edited

Pierre-Sassoulas commented May 13, 2024

DanielNoord commented May 15, 2024

Pierre-Sassoulas commented May 16, 2024

Pierre-Sassoulas left a comment

jacobtylerwalls commented May 17, 2024

Cache _has_init calls to avoid repeated stats #2429

Cache _has_init calls to avoid repeated stats #2429

Conversation

correctmost commented May 11, 2024

Description

Performance

Type of Changes

codecov bot commented May 12, 2024 • edited

Codecov Report

DanielNoord commented May 12, 2024

correctmost commented May 12, 2024

DanielNoord commented May 13, 2024 • edited

Pierre-Sassoulas commented May 13, 2024

DanielNoord commented May 15, 2024

Pierre-Sassoulas commented May 16, 2024

Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

jacobtylerwalls commented May 17, 2024

codecov bot commented May 12, 2024 •

edited

DanielNoord commented May 13, 2024 •

edited