Map PCA script or function? #20

alisiafadini · 2022-08-17T10:48:45Z

Maybe there is one of these already, but haven't found it yet. I think it could be useful to have either a small script or function that can apply PCA to e.g. a set of time-ordered MTZs/maps. One could both use it for denoising but also to check out the output components, depending on the specific case. I already have some bits for this too and I'm sure I'm not the only one so potentially could consolidate?

kmdalton · 2022-08-17T13:48:17Z

@JBGreisman definitely has some code that computes the SVD of a stack of structure factors. AFAIK, it hasn't made it into rs-booster yet. I think that is more or less equivalent?

JBGreisman · 2022-08-17T21:23:37Z

I certainly have some code locally for analyzing (n_reflections x m_dataset) type stacks of structure factors by SVD. I hadn't thought of adding them to rs-booster because it ultimately boils down to a one-liner to np.linalg.svd(), but if there is interest I'd be happy to write that up in a way that takes multiple input MTZs using a -i type interface.

I've certainly used this in the past for "denoising" and feature detection, so could be useful

kmdalton · 2022-08-18T02:47:35Z

i think there's merit in putting a good SVD implementation out there. if only to help standardize things. i'm sure the hekstra lab will use it also. having it available as a command line script with a parser will certainly help with the learning curve.

i'm not sure exactly what the difference between SVD and PCA is in this context. is there merit in having both?

alisiafadini · 2022-08-18T14:33:05Z

Yes, my thinking was to help standardize and also just give a framework for new users who may not think to use it/may not be as familiar with these options to start with. I personally tried both SVD and PCA. Ended up using PCA but essentially was not seeing a difference in my examples. Traditionally, Marius Schmidt and others have used SVD but I don't see a theoretical issue with PCA.

JBGreisman · 2022-08-18T18:24:19Z

Sounds good -- I'll plan to add a script in the near future. From there we can decide if there are other useful decomposition methods that are worth supporting.

kmdalton · 2022-08-18T18:26:04Z

Maybe we can have a --method="{SVD,PCA}" flag to toggle. That would make it easy to add other decompositions later.

JBGreisman · 2022-08-18T18:30:23Z

rs.decompose -m/--method={SVD,PCA} -i input.mtz column_key makes some sense to me where it takes multiple -i entries and decompose the "inner" join of common reflections among the given files.

I'm open to other names as well... I can't decide if I like rs.decompose or not.

alisiafadini · 2022-08-19T15:59:40Z

What would you have it output? Just everything from the PCA/SVD function?
Maybe there could be a --reconstruct option that writes out denoised MTZs? rs.decompose is pretty intuitive – I'll have a think whether there is a better alternative

JBGreisman · 2022-08-19T18:46:02Z

I was thinking of using --n_components=5 or something like that to specify how many modes to output. Each mode could be outputted as a column of an MTZ, and the explained variance (or singular values) could be printed.

I agree that --reconstruct=0,1,2,3 could be a good interface for specifying how to output the denoised mtz that uses the first 4 components.

When I do this analysis I often first subtract the mean from each stack of reflections. Without doing this, the first mode in SVD will always be the mean. Do you think the subtraction of the mean is a fine default behavior? I usually think of this sort of analysis as being to find differences among a set of structure factors.

alisiafadini · 2022-08-23T14:56:32Z

I think it makes sense to subtract the mean. I don't usually do it myself (mostly just expect the first mode to be the mean and look at the others) but I don't think any information would be lost by what you propose @JBGreisman. Maybe instead of --reconstruct=0,1,2,3 there could be --reconstruct=0.05 where the specified value is the threshold of explained variance that you choose to cut for? I was thinking it could be more general than specifying the number of modes. Thoughts?

JBGreisman added the wishlist New requested features label Aug 17, 2022

JBGreisman self-assigned this Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map PCA script or function? #20

Map PCA script or function? #20

alisiafadini commented Aug 17, 2022

kmdalton commented Aug 17, 2022

JBGreisman commented Aug 17, 2022

kmdalton commented Aug 18, 2022

alisiafadini commented Aug 18, 2022

JBGreisman commented Aug 18, 2022

kmdalton commented Aug 18, 2022

JBGreisman commented Aug 18, 2022

alisiafadini commented Aug 19, 2022

JBGreisman commented Aug 19, 2022

alisiafadini commented Aug 23, 2022

Map PCA script or function? #20

Map PCA script or function? #20

Comments

alisiafadini commented Aug 17, 2022

kmdalton commented Aug 17, 2022

JBGreisman commented Aug 17, 2022

kmdalton commented Aug 18, 2022

alisiafadini commented Aug 18, 2022

JBGreisman commented Aug 18, 2022

kmdalton commented Aug 18, 2022

JBGreisman commented Aug 18, 2022

alisiafadini commented Aug 19, 2022

JBGreisman commented Aug 19, 2022

alisiafadini commented Aug 23, 2022