Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: numerical inconsistency in calculating rolling kurtosis #58711

Open
3 tasks done
HaloCollider opened this issue May 14, 2024 · 4 comments
Open
3 tasks done

BUG: numerical inconsistency in calculating rolling kurtosis #58711

HaloCollider opened this issue May 14, 2024 · 4 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@HaloCollider
Copy link

HaloCollider commented May 14, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

series_1 =  pd.Series([-1] + ([1] * 19))
print(series_1.kurt())
print(series_1.rolling(20).kurt().max())

series_2 = pd.Series(([-1] * 7) + ([1] * 19))
print(series_2.rolling(20).kurt().max())

series_3 = pd.Series(([-1] * 6) + ([1] * 19))
print(series_3.rolling(20).kurt().max())

Issue Description

I met a problem in calculating rolling kurtosis for a specific kind of data.

for series_1 = pd.Series([-1] + ([1] * 19)), I checked the source code and expected its kurtosis to be 20.00000000000001 because of the binary rounding error. While this holds true for calculating series_1.kurt(), the rolling version of it behaves oddly and returns an exact 20.0.

The numerical inconsistency also exists when I create another series series_2 = pd.Series(([-1] * 7) + ([1] * 19)). This time it returns 20.00000000000001, which is not equal to the max rolling kurtosis of series_1. However, series_3 would give a 20.0.

You can create similar series like above to see different behaviors. What is the rationale of it? Why would pandas sometimes give a 20.0?

Expected Behavior

Expected all results to be 20.00000000000001.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : 7.4.0
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2023.4.0
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.4.0
scipy : 1.11.1
snappy :
sqlalchemy : 1.4.39
tables : 3.8.0
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : 0.19.0
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None

@HaloCollider HaloCollider added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 14, 2024
@auderson
Copy link
Contributor

auderson commented May 14, 2024

The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts.
If you really want consistent result you can try rolling.apply(lambda x: x.kurt()), but this is much slower.

@HaloCollider
Copy link
Author

HaloCollider commented May 14, 2024

The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts. If you really want consistent result you can try rolling.apply(lambda x: x.kurt()), but this is much slower.

I understand that the online updating method may cause numerical instability in a time-series manner. But 20.0 or larger than 20.0 is an overall characteristic of a series. In other words, you cannot get a 20.0 and a larger than 20.0 in a single series.

For example:
pd.Series([1] * 19 + [-1] * 7 + [1] * 1).rolling(20).kurt().max() gives a 20.00000000000001,
while
pd.Series([1] * 19 + [-1] * 7 + [1] * 2).rolling(20).kurt().max() gives a 20.0.
Their difference is just an additional 1 at the tail, which doesn't affect the max kurtosis from the view of online updating.

(That's also why I'm calling it inconsistency rather than instability.)

@auderson
Copy link
Contributor

Looks like it's due to a demean operation prior to calculation:

for i in range(0, V):
val = values_copy[i]
if val == val:
nobs_mean += 1
sum_val += val
mean_val = sum_val / nobs_mean
# Other cases would lead to imprecision for smallest values
if min_val - mean_val > -1e4:
mean_val = round(mean_val)
for i in range(0, V):
values_copy[i] = values_copy[i] - mean_val

@HaloCollider
Copy link
Author

Looks like it's due to a demean operation prior to calculation:

for i in range(0, V):
val = values_copy[i]
if val == val:
nobs_mean += 1
sum_val += val
mean_val = sum_val / nobs_mean
# Other cases would lead to imprecision for smallest values
if min_val - mean_val > -1e4:
mean_val = round(mean_val)
for i in range(0, V):
values_copy[i] = values_copy[i] - mean_val

Thanks a lot. This solves my issue. Previously I checked the source but missed the demean operation, which made my version produce consistent results that caused confusion.

I found the exact thresholds of the proportion of 1 of a series being 0.25 and 0.75, i.e., the mean being -0.5 and 0.5. Out of range (0.25 to 0.75) distributions lead to 20.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants