-
-
Notifications
You must be signed in to change notification settings - Fork 17.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: numerical inconsistency in calculating rolling kurtosis #58711
Comments
The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts. |
I understand that the online updating method may cause numerical instability in a time-series manner. But For example: (That's also why I'm calling it inconsistency rather than instability.) |
Looks like it's due to a demean operation prior to calculation: pandas/pandas/_libs/window/aggregations.pyx Lines 828 to 838 in 283a2dc
|
Thanks a lot. This solves my issue. Previously I checked the source but missed the demean operation, which made my version produce consistent results that caused confusion. I found the exact thresholds of the proportion of |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I met a problem in calculating rolling kurtosis for a specific kind of data.
for
series_1 = pd.Series([-1] + ([1] * 19))
, I checked the source code and expected its kurtosis to be20.00000000000001
because of the binary rounding error. While this holds true for calculatingseries_1.kurt()
, the rolling version of it behaves oddly and returns an exact20.0
.The numerical inconsistency also exists when I create another series
series_2 = pd.Series(([-1] * 7) + ([1] * 19))
. This time it returns20.00000000000001
, which is not equal to the max rolling kurtosis ofseries_1
. However,series_3
would give a20.0
.You can create similar series like above to see different behaviors. What is the rationale of it? Why would pandas sometimes give a
20.0
?Expected Behavior
Expected all results to be
20.00000000000001
.Installed Versions
INSTALLED VERSIONS
commit : 0f43794
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : 7.4.0
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2023.4.0
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.4.0
scipy : 1.11.1
snappy :
sqlalchemy : 1.4.39
tables : 3.8.0
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : 0.19.0
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None
The text was updated successfully, but these errors were encountered: