Improving the lightkurve binning function by replacing the aggregate_downsample function #1328

tylerapritchard · 2023-05-02T14:20:25Z

This is a work-in-progress PR where we're looking at improving the binning functionality/accuracy/speed by replacing the astropy aggregate_downsample call with a custom version that allows per-column aggregate functions, and by adding additional optimization to reduceat

mirca · 2023-05-02T14:23:51Z

I just wanted to point out that there are some issues with the computation of bins from the method 'knuth' in

lightkurve/src/lightkurve/lightcurve.py

Line 1498 in 209d0c8

if (bins not in ('blocks', 'knuth', 'scott', 'freedman') and

See astropy/astropy#8326 (comment) and astropy/astropy#11357. This might be something we would want to consider changing within a newer implementation of bin for lightkurve.

…hard/lightkurve into replace_aggregate_downsample

tylerapritchard · 2023-05-02T17:52:04Z

This now should be functional and working to 0th order, implemented essentially the current defaults in the current version.

Timing tests suggest a ~2.5x speedup:
Initial test: Total time: 13.8932 s
Current test: Total time: 5.51083 s

dfm · 2023-05-02T18:54:15Z

See this repo for some faster accumulation functions!

dfm · 2023-05-02T19:56:14Z

Using that C library linked above, I get another order of magnitude speed up on my system! I'll look into contributing that upstream to astropy.

dfm · 2023-05-02T20:08:40Z

Here are the benchmarks:

on main (5.84s):

on tylerapritchard/replace_aggregate_downsample (2.82s):

incorporating the (highly incomplete) C library above (164ms):

tylerapritchard · 2023-05-02T20:09:27Z

Made further tweaks with @dfm! Current changes from the previous default behavior:

Previously, if the flux_err of a lightcurve was full of NaNs, in a binned lightcurve the flux error was replaced with the standard deviation of the binned flux points. This could be misleading because this was labeled an "error" but could contain astrophysical signals due to intrinsic flux variance. To expose this more clearly to users, when binning a lightkurve we're now addidng a column - 'flux_bin_std' that contains this value - and when binning errors that are all NaN values will now return NaN values. This could break users current functionality!
Previously, lightcurve.bin executed two aggregate_downsample calls due to needing to treat errors differently than flux values, as Astropy.Timeseries.aggregate_downsample requires the same function in the binning to be used across all columns. We have created a minimally modified version of this function that allows the user to supply a function per column while attempting to maintain the current functionality of aggregate_downsample otherwise.
We are using this modified function to allow us to call aggregate_downsample a single time, with different binning logic for different columns. We have supplied potentially sensible defaults for columns
- Current defaults are:
- flux - mean - SAME as current default
- flux_err - average root sum squares - SAME as the current default
- flux_bin_std - NEW column, standard deviation
- quality: bitwise or of the quality flags - NEW (previously the mean of the quality flags)
- cadenceno: median cadence value NEW (previously the integer rounded mean of the cadence values)
as @dfm shows above this is a pretty huge effect! Long term the hope would be to upstream the downsample.py changes to astropy and get them to incorporate the compiled-c repository for reduceat. However, currently, the local version is a significant increase in speed and accuracy and potentially worth supporting until then. should we include a (potentially optional) dependency for the c-code above to get a 10x speedup in the short term as well?

Some things that might be worth some discussion? (and what else?)

Are these sensible defaults? Cadenceno and quality flag combining were new decisions, and could certainly be different.
Is the new treatment of lightcurves without errorbars, and the potential breaking of currently extant workflows, acceptable?
One of our concerns with the current bin function was that it was opaque to users. What else do we need to make more clear? How can we do that? There are the lightkurve.meta keys, what can/should we add there?
A more proper, general, treatment of errors would involve time weighted errors - this could be done with minimal overhead but would be a more complicated change, we'll open a separate issue about that .
As @mirca mentions above, there are more issues with calculating bins from astropy, how should we handle this? is there something to upstream?

christinahedges · 2023-06-06T20:05:47Z

This looks really great @tylerapritchard let's update

Changelog
Tests
Docstrings update
Update API docs for new downsample module?

Seems like you have good ideas on some sane defaults. This needs a minor version update when we merge it.

added downsample.py

4dbbe26

tylerapritchard added 🗑️ binning-problem This is specifically an issue about lightkurve's `bin` function 💪 WIP This is a work in progress! labels May 2, 2023

dfm and others added 5 commits May 2, 2023 10:57

small tweaks to reduceat

b3f6a97

streamlining aggregate_downsample calls

bf5439e

Merge branch 'replace_aggregate_downsample' of github.com:tylerapritc…

ac381aa

…hard/lightkurve into replace_aggregate_downsample

update downsample for column functions

4556c16

initial default dictionary implemented, binning timing tests

e86dd24

update quality flag aggregation

d21a1f3

update cadenceno name in default aggregate functions

b8ca17f

black formatted

49bc31d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the lightkurve binning function by replacing the aggregate_downsample function #1328

Improving the lightkurve binning function by replacing the aggregate_downsample function #1328

tylerapritchard commented May 2, 2023

mirca commented May 2, 2023

tylerapritchard commented May 2, 2023

dfm commented May 2, 2023

dfm commented May 2, 2023

dfm commented May 2, 2023

tylerapritchard commented May 2, 2023 •

edited

christinahedges commented Jun 6, 2023 •

edited

Improving the lightkurve binning function by replacing the aggregate_downsample function #1328

Are you sure you want to change the base?

Improving the lightkurve binning function by replacing the aggregate_downsample function #1328

Conversation

tylerapritchard commented May 2, 2023

mirca commented May 2, 2023

tylerapritchard commented May 2, 2023

dfm commented May 2, 2023

dfm commented May 2, 2023

dfm commented May 2, 2023

tylerapritchard commented May 2, 2023 • edited

christinahedges commented Jun 6, 2023 • edited

tylerapritchard commented May 2, 2023 •

edited

christinahedges commented Jun 6, 2023 •

edited