Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making AstroidManager interactable from multiple processes #2048

Open
DanielNoord opened this issue Mar 9, 2023 · 8 comments
Open

Making AstroidManager interactable from multiple processes #2048

DanielNoord opened this issue Mar 9, 2023 · 8 comments

Comments

@DanielNoord
Copy link
Collaborator

I know, another issue about multiprocessing...

I was reading up on multiprocessing and sharing caches among those processes and it seems as if we need something like multiprocessing.SyncManager to allow accessing a cache between multiple processes.
I think it would be worthwhile to explore whether we can make AstroidManager a SyncManager that needs to be instantiated by whoever needs it rather than having it be a singleton.

However, before spending considerable time exploring this I want to see if others have had tries at this or other ideas about allowing pylint and astroid to "parse modules and nodes in multiple process while keeping a cache between those processes".

Taggin @jacobtylerwalls in particular as they have tinkered with this as well.

@jacobtylerwalls
Copy link
Member

If it were me, I'd want to just step through and profile the entire current implementation. I've yet to do that, so I really don't have a forecast at this point. My worry is that we're unnecessarily building a lot of unnecessary astroid asts. If five python modules import pandas, are we building an ast for pandas five times?

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented May 14, 2023

Thanks for the link. I think with the SyncManager we'll run into trouble with unpickable objects. I did see, this, though:

Better to inherit than pickle/unpickle

Do you mind if I give this a spin?

@jacobtylerwalls jacobtylerwalls self-assigned this May 14, 2023
@DanielNoord
Copy link
Collaborator Author

I was able to pickle most objects with cloudpickle.
Not sure if pasting the brain helps that much? Isn't the issue more about having to paste the astroid_cache?

@jacobtylerwalls
Copy link
Member

Not sure if pasting the brain helps that much? Isn't the issue more about having to paste the astroid_cache?

That's what I was referring to:

brain: AstroidManagerBrain = {
"astroid_cache": {},

@jacobtylerwalls
Copy link
Member

Perhaps:

  • parallelize _get_asts()
  • collect those results into one set (where results = both the trees and the astroid cache formed along the way)
  • send that to the workers when parallelizing the lint runs

That way worker processes never have to update shared state. 🤔

@DanielNoord
Copy link
Collaborator Author

Perhaps:

  • parallelize _get_asts()
  • collect those results into one set (where results = both the trees and the astroid cache formed along the way)
  • send that to the workers when parallelizing the lint runs

That way worker processes never have to update shared state. 🤔

I did this, but this gives minimal gain. The creation of an astroid.Module for a module via ast_from_file isn't the main performance bottleneck. I created a astroidd deamon (for lack of a better name) but saw almost no significant performance gain. Another issue was that you can't pickle all astroid.Module objects, and that if you could almost all performance benefits of "caching" ast_from_file is lost during the pickling.

What we should "cache"/parallelize is node.infer() as that is where we start inferring ImportNodes which seem to take the most time currently. However, during _get_asts we don't really know which ImportNodes we will see and it also seems like a bit of a bad design to just follow all ImportNodes at the start even though we might not need them.

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented Jun 15, 2023

Do you have an experimental branch where you tried this stuff? If the bottleneck really is somewhere else it would be instructive to profile it and find out where it's lurking instead. EDIT: ah, you did say you expect the bottleneck to be in inferring import nodes.

@DanielNoord
Copy link
Collaborator Author

Sadly I don't as the code wasn't achieving what I tried to do, so I deleted it 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants