Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABOD weighting schemes / not matching the definition? #522

Open
kno10 opened this issue Aug 16, 2023 · 1 comment
Open

ABOD weighting schemes / not matching the definition? #522

kno10 opened this issue Aug 16, 2023 · 1 comment
Assignees

Comments

@kno10
Copy link

kno10 commented Aug 16, 2023

The ABOD implementation of PyOD tends to perform very similar to kNN, because the scoring is dominated by the squared (!) distances.

The following lines:

pyod/pyod/models/abod.py

Lines 50 to 53 in 6c77e27

# wcos = (<a_curr, b_curr>/((|a_curr|*|b_curr|)^2)
wcos = np.dot(a_curr, b_curr) / (
np.linalg.norm(a_curr, 2) ** 2) / (
np.linalg.norm(b_curr, 2) ** 2)

while closely reflecting the original paper, may be reflecting a issue and inconsistency of that paper.
$$ABOF=VAR\left(\frac{\langle AB,AC \rangle}{||AB||^2 \cdot ||AC||^2}\right)$$
uses the product of the squared weights for normalization, which puts an extreme high emphasis on distance (hence causing a high similarity to kNN).
But if you then look how the equation is expanded, another factor is added:
$$=\frac{\sum_B\sum_C \left(\frac{1}{||AB||\cdot ||AC||}\cdot\frac{\langle AB, AC \rangle}{||AB||^2 \cdot ||AC||^2}\right)^2}{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}} - \left(\frac{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}\cdot\frac{\langle AB, AC \rangle}{||AB||^2 \cdot ||AC||^2}}{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}}\right)^2$$
(clearly resembling $E[X^2]-E[X]^2$, with each point weighted by $\frac{1}{||AB||\cdot ||AC||}$, which would match the text).
Note that this would simplify to having even the cubic(!) distances in there.

The IMHO correct version would be the following:
$$ABOF=VAR_{weighted}\left(\frac{\langle AB,AC \rangle}{||AB|| \cdot ||AC||}\right)$$ $$=\frac{\sum_B\sum_C \left(\frac{1}{||AB||\cdot ||AC||}\cdot\frac{\langle AB, AC \rangle}{||AB|| \cdot ||AC||}\right)^2}{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}} - \left(\frac{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}\cdot\frac{\langle AB, AC \rangle}{||AB|| \cdot ||AC||}}{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}}\right)^2$$
which then can further be simplified to:
$$=\frac{\sum_B\sum_C \left(\frac{\langle AB, AC \rangle}{||AB||^2 \cdot ||AC||^2}\right)^2}{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}} - \left(\frac{\sum_B\sum_C \frac{\langle AB, AC \rangle}{||AB||^2 \cdot ||AC||^2}}{\sum_B\sum_C \frac{1}{||AB||\cdot ||AC||}}\right)^2$$
We now got the squares in the equation seen before, one coming from the weighting scheme and one from the angle.
Note that this computation is likely to produce numerical instabilities, though, and I would rather use $1/\sqrt{||AB|| \cdot ||AC||}$ for weighting (which uses the geometric mean distance, instead of the product of distances).

Hence,

  1. there may be an issue in the original ABOD paper
  2. the "Var" is not the standard variance call, but a weighted variance, as the expansion of the equation shows. The current PyOD implementation does not implement this:
    return np.var(wcos_list)
  3. it may be good to allow the user to choose among the two "angle" computations (with the extra square, and the standard definition of angle), and three weighting schemes ((1) unweighted variance, as currently implemented in PyOD, (2) variance weighted with 1/(d1 * d2) as in the ABOD paper, and (3) with the geometric mean distance to the neighbors, 1/sqrt(d1 * d2), which makes the weighting weaker)
  4. the code needs to be carefully optimized for numerial issues, in particular as some of these values can become very small fast due to products of two powers of three.

Note that also, the "norm" function in the top snipped involves a square root that is then followed by a square.

Now I argue that the most meaningful variant is using standard angles is very simple. It is the only version that is invariant to scaling the data set with a constant. If we multiply all lengths by 2,
$$\left(\frac{\langle 2AB,2AC \rangle}{||2AB||^2 \cdot ||2AC||^2}\right)=\left(\frac{4\langle AB,AC \rangle}{4||AB||^2 \cdot 4||AC||^2}\right)=\frac{1}{4} \left(\frac{\langle AB,AC \rangle}{||AB||^2 \cdot ||AC||^2}\right)$$ Removing the extra squares - using the standard angle - resolve this and makes the angle invariant to scaling with a constant.

@kno10 kno10 changed the title ABOD weighting schemes ABOD weighting schemes / not matching the definition? Aug 16, 2023
@yzhao062
Copy link
Owner

This is deeply appreciated! Since I am just moving between states, will need to get back on this once settling down :)

@yzhao062 yzhao062 self-assigned this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants