Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the lightkurve binning function by replacing the aggregate_downsample function #1328

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

tylerapritchard
Copy link
Collaborator

This is a work-in-progress PR where we're looking at improving the binning functionality/accuracy/speed by replacing the astropy aggregate_downsample call with a custom version that allows per-column aggregate functions, and by adding additional optimization to reduceat

@tylerapritchard tylerapritchard added 🗑️ binning-problem This is specifically an issue about lightkurve's `bin` function 💪 WIP This is a work in progress! labels May 2, 2023
@mirca
Copy link
Member

mirca commented May 2, 2023

I just wanted to point out that there are some issues with the computation of bins from the method 'knuth' in

if (bins not in ('blocks', 'knuth', 'scott', 'freedman') and

See astropy/astropy#8326 (comment) and astropy/astropy#11357. This might be something we would want to consider changing within a newer implementation of bin for lightkurve.

@tylerapritchard
Copy link
Collaborator Author

This now should be functional and working to 0th order, implemented essentially the current defaults in the current version.

Timing tests suggest a ~2.5x speedup:
Initial test: Total time: 13.8932 s
Current test: Total time: 5.51083 s

@dfm
Copy link
Collaborator

dfm commented May 2, 2023

See this repo for some faster accumulation functions!

@dfm
Copy link
Collaborator

dfm commented May 2, 2023

Using that C library linked above, I get another order of magnitude speed up on my system! I'll look into contributing that upstream to astropy.

@dfm
Copy link
Collaborator

dfm commented May 2, 2023

Here are the benchmarks:

  • on main (5.84s):

Screenshot 2023-05-02 at 4 03 52 PM

  • on tylerapritchard/replace_aggregate_downsample (2.82s):

Screenshot 2023-05-02 at 4 04 27 PM

  • incorporating the (highly incomplete) C library above (164ms):

Screenshot 2023-05-02 at 4 05 09 PM

@tylerapritchard
Copy link
Collaborator Author

tylerapritchard commented May 2, 2023

Made further tweaks with @dfm! Current changes from the previous default behavior:

  • Previously, if the flux_err of a lightcurve was full of NaNs, in a binned lightcurve the flux error was replaced with the standard deviation of the binned flux points. This could be misleading because this was labeled an "error" but could contain astrophysical signals due to intrinsic flux variance. To expose this more clearly to users, when binning a lightkurve we're now addidng a column - 'flux_bin_std' that contains this value - and when binning errors that are all NaN values will now return NaN values. This could break users current functionality!

  • Previously, lightcurve.bin executed two aggregate_downsample calls due to needing to treat errors differently than flux values, as Astropy.Timeseries.aggregate_downsample requires the same function in the binning to be used across all columns. We have created a minimally modified version of this function that allows the user to supply a function per column while attempting to maintain the current functionality of aggregate_downsample otherwise.

  • We are using this modified function to allow us to call aggregate_downsample a single time, with different binning logic for different columns. We have supplied potentially sensible defaults for columns
    - Current defaults are:
    - flux - mean - SAME as current default
    - flux_err - average root sum squares - SAME as the current default
    - flux_bin_std - NEW column, standard deviation
    - quality: bitwise or of the quality flags - NEW (previously the mean of the quality flags)
    - cadenceno: median cadence value NEW (previously the integer rounded mean of the cadence values)

  • as @dfm shows above this is a pretty huge effect! Long term the hope would be to upstream the downsample.py changes to astropy and get them to incorporate the compiled-c repository for reduceat. However, currently, the local version is a significant increase in speed and accuracy and potentially worth supporting until then. should we include a (potentially optional) dependency for the c-code above to get a 10x speedup in the short term as well?

Some things that might be worth some discussion? (and what else?)

  • Are these sensible defaults? Cadenceno and quality flag combining were new decisions, and could certainly be different.
  • Is the new treatment of lightcurves without errorbars, and the potential breaking of currently extant workflows, acceptable?
  • One of our concerns with the current bin function was that it was opaque to users. What else do we need to make more clear? How can we do that? There are the lightkurve.meta keys, what can/should we add there?
  • A more proper, general, treatment of errors would involve time weighted errors - this could be done with minimal overhead but would be a more complicated change, we'll open a separate issue about that .
  • As @mirca mentions above, there are more issues with calculating bins from astropy, how should we handle this? is there something to upstream?

@christinahedges
Copy link
Collaborator

christinahedges commented Jun 6, 2023

This looks really great @tylerapritchard let's update

  • Changelog
  • Tests
  • Docstrings update
  • Update API docs for new downsample module?

Seems like you have good ideas on some sane defaults. This needs a minor version update when we merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 WIP This is a work in progress! 🗑️ binning-problem This is specifically an issue about lightkurve's `bin` function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants