Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map PCA script or function? #20

Open
alisiafadini opened this issue Aug 17, 2022 · 10 comments
Open

Map PCA script or function? #20

alisiafadini opened this issue Aug 17, 2022 · 10 comments
Assignees
Labels
wishlist New requested features

Comments

@alisiafadini
Copy link

Maybe there is one of these already, but haven't found it yet. I think it could be useful to have either a small script or function that can apply PCA to e.g. a set of time-ordered MTZs/maps. One could both use it for denoising but also to check out the output components, depending on the specific case. I already have some bits for this too and I'm sure I'm not the only one so potentially could consolidate?

@kmdalton
Copy link
Member

@JBGreisman definitely has some code that computes the SVD of a stack of structure factors. AFAIK, it hasn't made it into rs-booster yet. I think that is more or less equivalent?

@JBGreisman
Copy link
Member

I certainly have some code locally for analyzing (n_reflections x m_dataset) type stacks of structure factors by SVD. I hadn't thought of adding them to rs-booster because it ultimately boils down to a one-liner to np.linalg.svd(), but if there is interest I'd be happy to write that up in a way that takes multiple input MTZs using a -i type interface.

I've certainly used this in the past for "denoising" and feature detection, so could be useful

@JBGreisman JBGreisman added the wishlist New requested features label Aug 17, 2022
@kmdalton
Copy link
Member

i think there's merit in putting a good SVD implementation out there. if only to help standardize things. i'm sure the hekstra lab will use it also. having it available as a command line script with a parser will certainly help with the learning curve.

i'm not sure exactly what the difference between SVD and PCA is in this context. is there merit in having both?

@alisiafadini
Copy link
Author

Yes, my thinking was to help standardize and also just give a framework for new users who may not think to use it/may not be as familiar with these options to start with. I personally tried both SVD and PCA. Ended up using PCA but essentially was not seeing a difference in my examples. Traditionally, Marius Schmidt and others have used SVD but I don't see a theoretical issue with PCA.

@JBGreisman
Copy link
Member

Sounds good -- I'll plan to add a script in the near future. From there we can decide if there are other useful decomposition methods that are worth supporting.

@JBGreisman JBGreisman self-assigned this Aug 18, 2022
@kmdalton
Copy link
Member

Maybe we can have a --method="{SVD,PCA}" flag to toggle. That would make it easy to add other decompositions later.

@JBGreisman
Copy link
Member

rs.decompose -m/--method={SVD,PCA} -i input.mtz column_key makes some sense to me where it takes multiple -i entries and decompose the "inner" join of common reflections among the given files.

I'm open to other names as well... I can't decide if I like rs.decompose or not.

@alisiafadini
Copy link
Author

What would you have it output? Just everything from the PCA/SVD function?
Maybe there could be a --reconstruct option that writes out denoised MTZs? rs.decompose is pretty intuitive – I'll have a think whether there is a better alternative

@JBGreisman
Copy link
Member

I was thinking of using --n_components=5 or something like that to specify how many modes to output. Each mode could be outputted as a column of an MTZ, and the explained variance (or singular values) could be printed.

I agree that --reconstruct=0,1,2,3 could be a good interface for specifying how to output the denoised mtz that uses the first 4 components.

When I do this analysis I often first subtract the mean from each stack of reflections. Without doing this, the first mode in SVD will always be the mean. Do you think the subtraction of the mean is a fine default behavior? I usually think of this sort of analysis as being to find differences among a set of structure factors.

@alisiafadini
Copy link
Author

I think it makes sense to subtract the mean. I don't usually do it myself (mostly just expect the first mode to be the mean and look at the others) but I don't think any information would be lost by what you propose @JBGreisman. Maybe instead of --reconstruct=0,1,2,3 there could be --reconstruct=0.05 where the specified value is the threshold of explained variance that you choose to cut for? I was thinking it could be more general than specifying the number of modes. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist New requested features
Projects
None yet
Development

No branches or pull requests

3 participants