Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternative argument for determining one-tailed or two-tailed permutation tests #199

Open
JosiahParry opened this issue Feb 13, 2022 · 20 comments
Assignees

Comments

@JosiahParry
Copy link

JosiahParry commented Feb 13, 2022

It appears that when a p-value exceeds 0.5, a different method of p-value calculation is used rather than the simulation p-value formula (M + 1) / (R + 1).

The below example manually calculates the p-value for each observation.

Can the p-value calculation method be documented or corrected?

import libpysal
import numpy as np
import geopandas as gpd
from esda.join_counts_local import Join_Counts_Local

fp = libpysal.examples.root + "/guerry/" + "Guerry.shp" 

guerry_ds = gpd.read_file(fp)
guerry_ds['SELECTED'] = 0
guerry_ds.loc[(guerry_ds['Donatns'] > 10997), 'SELECTED'] = 1

w = libpysal.weights.Queen.from_dataframe(guerry_ds)

LJC_uni = Join_Counts_Local(connectivity=w, keep_simulations = True).fit(guerry_ds['SELECTED'])

# Calculate P-values manually

index = np.where(~np.isnan(ps))
index = index[0].tolist()
# I think these are the simulations
sims = LJC_uni.rjoins
sims_p = sims[index]
obs_p = LJC_uni.LJC[index]
ps = LJC_uni.p_sim[index]
nsim = LJC_uni.permutations
sims_p[0]

def p_sim_calc(obs, sims, nsim = 999):
  return (sum(obs <= sims) + 1) / (nsim + 1)  
  
p_sim_calc(obs_p[0], sims_p[0])

p = list()
for i in range(len(ps)):
  x = p_sim_calc(obs_p[i], sims_p[i])
  p.append(x)

print(p, "\n", ps)

Note that when the calculated p-value is 0.5 the report simulated p-value is less than 0.5.

#> [0.409, 0.026, 0.026, 0.662, 0.652, 0.286, 0.662, 0.513,  0.03, 0.131, 0.051] 
#> [0.409, 0.026, 0.026, 0.339, 0.349, 0.286, 0.339, 0.488, 0.03, 0.131, 0.051]
@JosiahParry
Copy link
Author

After giving this a more thorough look, it appears that the local join count is choosing the lesser of the 2 p-values from greater or lesser. Is there a way to select either lesser or greater alternatives when calculating the p-values? Choosing the lesser of the 2 feels somewhat off to me

@ljwolf
Copy link
Member

ljwolf commented Feb 14, 2022

Thanks for the report! This is the intended behavior of our two-sided permutation test. The LJC does this, as do all of our simulation p-values (any .p_sim) across edsa. Your request to

select either lesser or greater alternatives

amounts to a one-tailed permutation test. We havent implemented a one-tailed version, but we're happy to implemented this with an alternative keyword argument.

To try to explain why we're doing this, maybe this will be helpful.

For most autocorrelation statistics (especially local stats) we don't have a prior hypothesis about the direction of the test. So, it's reasonable to default to a "two-sided test:" we want to know how extreme a given test statistic is under a specific null hypothesis.

Now, we need a way to assess extremity. We don't have anything "analytical" (like, equations & the like), so we synthesize replicates. We simulate from the "conditional randomization null hypothesis" using permutation to build up a set of "fake" local test statistics. We just wrote Sauer et al. (2021) (preprint) with plenty of discussion on the topic, but the original outlines come from Anselin (1995).

Computationally, this means we have a target set of replicates and a test statistic. We want to figure out how "extreme" that test statistic is, given the distribution of replicates. So, we want the fraction of values that are on the "other side" of the test statistic from the rest of the distribution... but how do we define "other side?" There are three ways.

  1. Make assumptions and derive an expected value. This lets us know what "side" of the distribution the test statistic is on. Indeed, this generally even lets us test against a standard normal distribution, so we don't even need the replicates!
  2. "Fold" the distribution; center the distribution of replicates, take the absolute value of the test statistic and the replicates, then count the fraction of larger replicates as the p value. This is justifiable; it gives you p-values between 0 and 1. But, this results in extreme values on the left side of the distribution "counting" as extreme values on the right side (and vice versa), and tests a slightly different null (about |s|, not s). Further, this is sensitive to the folding point (mean vs. median).
  3. Our nonparametric counting strategy. If the test statistic is on the "left" side of the distribution, there are fewer observations to its left. So, count the fraction replicates to the left as the p-value. Alternatively, if the test statistic is on the "right" side of the distribution, count the fraction to the right. This is the flipping you see: by picking the smaller P-value, we're counting the relevant fraction of observations that fall outside of the test statistic.

Where we can, we implement 1 and 3. In Moran_Local, this is p_z_norm and p_sim. We don't implement the folded variant.

from esda.moran import Moran_Local
import numpy
numpy.random.seed(112112)
lmo = Moran_Local(guerry_ds.Donatns, w)

#here, the local replicate matrix is the .rlisas attribute
folded_replicates = numpy.abs(lmo.rlisas - numpy.median(lmo.rlisas, axis=1, keepdims=True))
folded_p_sim = (folded_replicates >= numpy.abs(lmo.Is[:,None])).mean(axis=1)

print((folded_p_sim < .01).sum()) # 0 are clusters using folding
print((lmo.p_sim < .01).sum()) # 9 are clusters using "outside" counting
print((lmo.p_z_sim*2 < .01).sum()) # 2 are clusters assuming a normal approximation

The nonparametric counting strategy was implemented by @sjsrey back in the early days of the library. I think this is a faithful interpretation of why he did it this way, but I don't think we've actually spoken about it explicitly 😮

Note: edits to code example for repro & correctness on the folding

@JosiahParry
Copy link
Author

Thank you so much for the added color! Much appreciated.
Changing the issue title. Close upon your discretion :)

@JosiahParry JosiahParry changed the title Local Join Count p-values possibly incorrect Add alternative argument for determining one-tailed or two-tailed permutation tests Feb 14, 2022
@jGaboardi
Copy link
Member

We'll go ahead and close this then, and we can reopen if needed in the future.

@ljwolf
Copy link
Member

ljwolf commented May 11, 2023

I've had the time to investigate the differences here after ruminating on some email exchanges with @rsbivand, and I think that we may want to report 2*p_sim, not p_sim, if we keep this approach.

For justification, we can think about the folding strategy (p_f) & the "outside" counting (p_o) which is implemented currently. The simplest "folding" strategy takes the absolute value of the random Lisas and counts how many are at least as large as the absolute value of the test statistic.

def folded_p(replicates, observed):
    n_extreme = (numpy.abs(replicates) >= numpy.abs(observed.reshape(-1,1)).sum(axis=1)) + 1
    return n_extreme / (replicates.shape[1]+1)

Using this very simple method, we get the a correlation between the p-values of 99%, but the "outside" p_o values are half as big:

pvals_and_agreement_scatter

I think this may lead to an overstatement of the number of "significant" local statistics at any given significance threshold. This overstatement seems to be proportional to the number of significant observations, but this largely goes away when we use 2*p_o:

pval_ecdf_and_rands

I think The agreement between p_f and 2p_o (yellow line in the right facet) is about what we would expect between two different simulations using the same p-value method. ,but I am testing this currently. Regardless, you can see on the left pane that folding and outside counting are in much stronger agreement when we double the p-values from outside-counting, 2p_o.

from libpysal import weights, examples
from esda.moran import Moran_Local
import geopandas, numpy


data = geopandas.read_file(examples.get_path("election.shp"))

rook = weights.Rook.from_dataframe(data)
knn1 = weights.KNN.from_dataframe(data, k=1)

w = weights.attach_islands(rook, knn1)


swing = data.pct_dem_16 - data.pct_dem_12
lmo = Moran_Local(swing.values, w, permutations=9999)


p_abs = (numpy.abs(lmo.rlisas) > numpy.abs(lmo.Is[:, None])).sum(axis=1) + 1.0
p_abs /= lmo.permutations + 1
p_sim = lmo.p_sim
numpy.corrcoef(p_abs, p_sim)[0,1]

@JosiahParry
Copy link
Author

@ljwolf, thank you very much for sharing! If you and Roger do end up doing a thorough write up, please let me know here. I'm following quite intently!

@sjsrey
Copy link
Member

sjsrey commented May 11, 2023

@JosiahParry the determination of the pseudo-values is done with the following:

esda/esda/crand.py

Lines 222 to 224 in 7f83108

low_extreme = (permutations - larger) < larger
larger[low_extreme] = permutations - larger[low_extreme]
p_sim = (larger + 1.0) / (permutations + 1.0)

which is for the local statistics. This follows the logic we use for a global statistic (which is arguably easier to grok):

esda/esda/moran.py

Lines 181 to 190 in 7f83108

if permutations:
sim = [
self.__calc(np.random.permutation(self.z)) for i in range(permutations)
]
self.sim = sim = np.array(sim)
above = sim >= self.I
larger = above.sum()
if (self.permutations - larger) < larger:
larger = self.permutations - larger
self.p_sim = (larger + 1.0) / (permutations + 1.0)

Your issue makes us realize we need to surface this logic in the documentation. Thanks!

@ljwolf
Copy link
Member

ljwolf commented May 25, 2023

@sjsrey any perspective on the 2*p_sim issue noted above? I think this would imply the outside counting method we currently implement is a one-tailed procedure, but we're picking and choosing which tail we're looking in on the fly.

@sjsrey
Copy link
Member

sjsrey commented May 26, 2023

@ljwolf the outside counting method we have is a one-tailed procedure that follows from logic in geoda. It does the selection of the tail for the user. I think we can surface this better in the docs.

For the lisas, I'm not clear on the use of the absolute values? That would seem to suggest you want a two-tailed test, not a directional test?

@ljwolf
Copy link
Member

ljwolf commented May 27, 2023

Right, so this is my confusion:

My understanding of a "classical" one-tailed test requires you to specify a direction a priori (e.g. alternative="greater"). Outside counting will specify the direction of our one-tailed test on the fly based on the most likely direction for a significant result. This means our testing procedure is equivalent to a two-sided test (counting the replicates that are more extreme than the observed value) at 1 - 2*\alpha%, which is what the simulations above show.

Now, this is normally the case for directed and undirected tests: alternative="greater" will have more power than alternative="two-sided" in the positive direction, but has no power in the negative direction. In our case, the gain in power is not motivated by the prior about the direction of the effect size, though: it comes from running two directed tests at each site and treating the p-value as if it comes from one directed test. Doing two directed tests and reporting the smaller p-value will increase the power of our test at a constant \alpha. Put another way, outside counting is a test for extremal values, since it flips depending on the direction of the replicate distribution. When you test for extremal values, you are doing a two-sided test, whose proper significance level is \alpha/2.

Does that make sense? Maybe I misunderstand.

@sjsrey
Copy link
Member

sjsrey commented May 27, 2023

I see what you are getting at, and I agree with the points you make.

My interpretation is the logic for the permutation based inference was born in the "pre-local" era (and before geoda) where the focus was on a single test statistic:

image
[1]

Things are more complicated for the local tests, since we now have n such tests and need to allow that for some locations the relevant tail may change.

At first glance, I think a fix might be to add a kw argument to the global test alternative="greater|lower|two-sided" and we carry on as we currently do, but return 2*p_sim when alternative=two-sided.

At the local level, we interpret all tests as one-tailed with the current logic. We could say the default is directional with an option for two-sided and if the user passes two-sided we return 2*p_sim.

Does this make sense to you?

[1] Hope, A. C. A. (1968). A Simplified Monte Carlo Significance Test Procedure. Journal of the Royal Statistical Society. Series B (Methodological), 30(3), 582–598.

@ljwolf
Copy link
Member

ljwolf commented May 27, 2023

It makes sense, but our "flip" is not documented in Hope (1968). With Hope's algorithm, you can definitely get one-tailed p-values above .5 for a one-tailed test. Our p-value can never be greater than .5. In the quote clipped above, I'm pretty sure the parenthetical "greater (less)" statement is with reference to "the form of the alternative hypothesis," not the "corresponding values." Once we look in both directions, we have to adjust for the fact that our reference set is now 2*permutations, not permutations. Hope notes this:

Screenshot 2023-05-27 at 12 31 39

Given the simulations above, I don't think we should continue to offer our current procedure. It does the one-tailed procedure described twice (once with alternative="greater" and again with alternative="lesser") and then picks the smaller p value. Presenting the result from two one-tailed tests as if it were a two-tailed test will double your power.

Another point of evidence: if we were to Bonferroni correct for this repeated one-tailed testing procedure (for example), the significance threshold would become alpha/2... meaning that p_sim*2 gives you the "right" result at alpha.

@ljwolf
Copy link
Member

ljwolf commented May 27, 2023

I should say: I recognize the issue that specifying the direction of local tests one by one would be impossible... but we could provide similar "auto" behavior by picking the direction of the test using EIc? If the test statistic is above its EIc, we run "greater", otherwise we run "lesser". Then, we have four alternative options:

  • "Greater" using our current routine with no flip
  • "Lesser" using our current routine with no flip
  • "Two sided" as the percent of replications outside of the test statistic's percentile and 1-percentile. This should be default.
  • "directed" as 2* our current routine, or picking the direction for each site depending on its EIc

@sjsrey
Copy link
Member

sjsrey commented May 28, 2023

I agree on the new logic. I'm a bit hesitant to deprecate the current behavior as it is based on what geoda does. Might we consider a policy="hope|geoda" parameter with the default hope that implements the new logic? If the user specifies policy=geoda it returns the current implementation.

@ljwolf
Copy link
Member

ljwolf commented May 28, 2023

Sounds like a great idea to me!

@rsbivand
Copy link

Very interesting discussion! I'll align spdep with your choices, probably with "hope" as default, once esda has been updated.

@JosiahParry
Copy link
Author

Hey folks! This issue has been brought up again in regards to spdep in a private email to me. Has esda modified the approach to address the alternative hypotheses such that the count of extremes is done in one tail in those cases? I think it would be a worthwhile effort for both esda and spdep to complete this. It almost feels as if an incorrect significance is being reported. For the sake of science I'd might even be willing to....write python! If it would be helpful! ;)

@ljwolf
Copy link
Member

ljwolf commented Feb 14, 2024

It's not implemented yet, but we agreed above to implement the following:

  • greater using our current routine, checking only the right tail with no flip
  • lesser using our current routine, checking only the left tail with no flip
  • two-sided as our current routine, but doubling the p-value). this should be the default.
  • directed as our current routine.

I also want to test whether a percentile rule will work (find the percentile p of the test statistic in the permutation distribution, then calculate the % of permutations outside of p and 1-p.)

I think that doing this should be easy, and should be done in a general function like:

def calculate_significance(test_stat, reference_distribution, method='two-sided'):
   ...

which we then just call within every statistic.

@JosiahParry
Copy link
Author

Sounds good! Should this issue be reopened or another made with less baggage?

@ljwolf
Copy link
Member

ljwolf commented Feb 14, 2024

I should say, @JosiahParry if you have time to start defining def calculate_significance() function in a separate file like significance.py, I can start pushing on that with you later this week 😄 I would really like this done by next release.

keep the discussion here, and I will reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants