Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FES incorrectly assigns samples larger than bin bounds, also does not handle empty bins correctly #511

Open
Lnaden opened this issue Jun 27, 2023 · 1 comment

Comments

@Lnaden
Copy link
Contributor

Lnaden commented Jun 27, 2023

This comment is for @mrshirts and @jchodera with relation to the implementation and theory in fes.py for samples

I managed to accidentally recreate the bug in #502 and have attached a minimal script to throw the error. However, in investigating what happened, I found 2 different bugs in the logic I wanted to discuss what the "correct" way to handle is.

Bug 1: Empty Bins

#502 seems to throw an error when a bin in the bin does not have a sample. In theory, the logic here is supposed to ensure every bin has a sample but, because of the logic slightly above it to assign bin_labels it will only ever check bins with samples, so the sanity logic can never trigger, nor should it for bootstrap samples. This will not manifest until the user gets the histogram here because its iterating over the N-D grid of bins generated here

I'm not sure what the correct approach is here, as suggested in #502, we could just flag any bin without samples at the line and set a np.nan as we do right below checking the bin label, but I don't know if thats technically or scientifically correct here since we can estimate bins without samples using MBAR (although in an FES histogram bin, it might make sense to nan the bin because it doesn't have samples.

Thoughts?

Bug 2: Samples > Bin upper bound are assigned to the maximum bin

The logic used in calls to np.digitize are assigning samples greater than the upper bound of a dimension to the maximum bin instead of omitting it. This leads to a bias on the bins closest to the maximum in any dimension. This does NOT throw an error through, even though its scientifically wrong.

The main function in question is np.digitize which does bin assignment for a sequence of samples and bin edges.

The two problematic lines in question are this line

bin_n[:, d] = np.digitize(x_n[:, d], bins[d]) - 1

and this line

fes_ref_grid[d] = np.digitize(fes_reference[d], bins[d]) - 1

Where a value is measured, checked against bins, and then its digitize index is shifted down one to say that "sample X is assigned to bin index Y." Which is fine for any sample less than the max bin edge for given dimension d. The logic of digitize is X < bins.min() --> 0 and X >= bins.max() --> len(bins). This means out of bounds on the low end is 0 which downshifted is index -1 that is checked against in various places in the code. However out of bounds on the high end is downshifted to len(bens) -1 which is a valid index for bins and does NOT throw an error, assigning out of bounds samples to a bin.

My proposed fix for this with minimal effort is to run the following logic:

bin_assignment =  np.digitize(x_n[:, d], bins[d])
bin_assignments[np.where(bin_assignments == len(bins[d]))] = 0  # Shift all out-of-upper-bounds bins to invalid
bin_n[:, d] = bin_assignments - 1  # reassign to bin index

and

fes_ref_bin =  np.digitize(fes_reference[d], bins[d]) 
fes_ref_bin[np.where(fes_ref_bin == len(bins[d]))] = 0  # Shift all out-of-upper-bounds bins to invalid
fes_ref_grid[d] = fes_ref_bin - 1  # reassign to bin index

The other two digitize calls in the code are captured here and also suffer this problem because they are allowing assignment of things out of bounds to the bin. They can be fixed in similar ways.

Questions: Am I correct in identifying this is a bug? Is the proposed solution a good fix for it?

Also cc'ing @mikemhenry since he wanted to be tagged in this.

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 27, 2023

Script to reproduce conditions of both bugs is here:

import numpy as np
from pymbar import FES

u_kn = np.zeros((3, 3))
N_k = np.ones(3)
bin_edge = np.linspace(1, 3, 10)
x_n = np.linspace(0, 3, 3) + 0.1  # The top and bottom sample should not be assigned bins, the upper one is.
u_n = np.ones(3)

histogram_parameters = dict()
histogram_parameters["bin_edges"] = bin_edge

fes = FES(u_kn, N_k)
fes.generate_fes(u_n, x_n, fes_type="histogram", histogram_parameters=histogram_parameters)
bin_centers = bin_edge + (bin_edge[1] - bin_edge[0])/2

results = fes.get_fes(
    bin_centers,
    reference_point="from-specified",
    fes_reference=70,  # This should throw an error, it doesn't.
    uncertainty_method="analytical",
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant