Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concatenating non overlapping time series with non ns unit leads to dataframe with missing data #58471

Open
2 of 3 tasks
jorgehv opened this issue Apr 29, 2024 · 1 comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@jorgehv
Copy link

jorgehv commented Apr 29, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

idx1 = pd.date_range("2024-01-01", periods=24*12, freq="5min", unit="us")
idx2 = pd.date_range("2024-01-02", periods=24*12, freq="5min", unit="us")
idx3 = pd.date_range("2024-01-03", periods=24*12, freq="5min", unit="us")

ts1 = pd.Series(range(len(idx1)), index=idx1)
ts2 = pd.Series(range(len(idx2)), index=idx2)
ts3 = pd.Series(range(len(idx3)), index=idx3)

df = pd.concat([ts1, ts2, ts3], axis=1)

print(ts1.index)
# DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 00:05:00',
#                '2024-01-01 00:10:00', '2024-01-01 00:15:00',
#                '2024-01-01 00:20:00', '2024-01-01 00:25:00',
#                '2024-01-01 00:30:00', '2024-01-01 00:35:00',
#                '2024-01-01 00:40:00', '2024-01-01 00:45:00',
#                ...
#                '2024-01-01 23:10:00', '2024-01-01 23:15:00',
#                '2024-01-01 23:20:00', '2024-01-01 23:25:00',
#                '2024-01-01 23:30:00', '2024-01-01 23:35:00',
#                '2024-01-01 23:40:00', '2024-01-01 23:45:00',
#                '2024-01-01 23:50:00', '2024-01-01 23:55:00'],
#               dtype='datetime64[us]', length=288, freq='5min')

print(ts2.index)
# DatetimeIndex(['2024-01-02 00:00:00', '2024-01-02 00:05:00',
#                '2024-01-02 00:10:00', '2024-01-02 00:15:00',
#                '2024-01-02 00:20:00', '2024-01-02 00:25:00',
#                '2024-01-02 00:30:00', '2024-01-02 00:35:00',
#                '2024-01-02 00:40:00', '2024-01-02 00:45:00',
#                ...
#                '2024-01-02 23:10:00', '2024-01-02 23:15:00',
#                '2024-01-02 23:20:00', '2024-01-02 23:25:00',
#                '2024-01-02 23:30:00', '2024-01-02 23:35:00',
#                '2024-01-02 23:40:00', '2024-01-02 23:45:00',
#                '2024-01-02 23:50:00', '2024-01-02 23:55:00'],
#               dtype='datetime64[us]', length=288, freq='5min')

print(ts3.index)
# DatetimeIndex(['2024-01-03 00:00:00', '2024-01-03 00:05:00',
#                '2024-01-03 00:10:00', '2024-01-03 00:15:00',
#                '2024-01-03 00:20:00', '2024-01-03 00:25:00',
#                '2024-01-03 00:30:00', '2024-01-03 00:35:00',
#                '2024-01-03 00:40:00', '2024-01-03 00:45:00',
#                ...
#                '2024-01-03 23:10:00', '2024-01-03 23:15:00',
#                '2024-01-03 23:20:00', '2024-01-03 23:25:00',
#                '2024-01-03 23:30:00', '2024-01-03 23:35:00',
#                '2024-01-03 23:40:00', '2024-01-03 23:45:00',
#                '2024-01-03 23:50:00', '2024-01-03 23:55:00'],
#               dtype='datetime64[us]', length=288, freq='5min')

print(df.index)
# DatetimeIndex(['2024-01-01 00:00:00', '2024-01-02 00:00:00',
#                '2024-01-03 00:00:00', '2024-01-03 00:05:00',
#                '2024-01-03 00:10:00', '2024-01-03 00:15:00',
#                '2024-01-03 00:20:00', '2024-01-03 00:25:00',
#                '2024-01-03 00:30:00', '2024-01-03 00:35:00',
#                ...
#                '2024-01-03 23:20:00', '2024-01-03 23:25:00',
#                '2024-01-03 23:30:00', '2024-01-03 23:35:00',
#                '2024-01-03 23:40:00', '2024-01-03 23:45:00',
#                '2024-01-03 23:50:00', '2024-01-03 23:55:00',
#                '2024-01-04 11:20:00', '2024-01-05 11:20:00'],
#               dtype='datetime64[us]', length=292, freq=None)

print(df.to_string())
#                        0    1      2
# 2024-01-01 00:00:00  0.0  NaN    NaN
# 2024-01-02 00:00:00  NaN  0.0    NaN
# 2024-01-03 00:00:00  NaN  NaN    0.0
# 2024-01-03 00:05:00  NaN  NaN    1.0
# 2024-01-03 00:10:00  NaN  NaN    2.0
# 2024-01-03 00:15:00  NaN  NaN    3.0
# 2024-01-03 00:20:00  NaN  NaN    4.0
# 2024-01-03 00:25:00  NaN  NaN    5.0
# 2024-01-03 00:30:00  NaN  NaN    6.0
# 2024-01-03 00:35:00  NaN  NaN    7.0
# 2024-01-03 00:40:00  NaN  NaN    8.0
# 2024-01-03 00:45:00  NaN  NaN    9.0
# 2024-01-03 00:50:00  NaN  NaN   10.0
# ...
# 2024-01-03 22:50:00  NaN  NaN  274.0
# 2024-01-03 22:55:00  NaN  NaN  275.0
# 2024-01-03 23:00:00  NaN  NaN  276.0
# 2024-01-03 23:05:00  NaN  NaN  277.0
# 2024-01-03 23:10:00  NaN  NaN  278.0
# 2024-01-03 23:15:00  NaN  NaN  279.0
# 2024-01-03 23:20:00  NaN  NaN  280.0
# 2024-01-03 23:25:00  NaN  NaN  281.0
# 2024-01-03 23:30:00  NaN  NaN  282.0
# 2024-01-03 23:35:00  NaN  NaN  283.0
# 2024-01-03 23:40:00  NaN  NaN  284.0
# 2024-01-03 23:45:00  NaN  NaN  285.0
# 2024-01-03 23:50:00  NaN  NaN  286.0
# 2024-01-03 23:55:00  NaN  NaN  287.0
# 2024-01-04 11:20:00  NaN  NaN    NaN
# 2024-01-05 11:20:00  NaN  NaN    NaN

df2 = pd.concat([ts1, ts2], axis=1)
print(df2.index)
# DatetimeIndex(['2024-01-01 00:00:00', '2024-01-02 00:00:00',
#                '2024-01-04 11:20:00', '2024-01-05 11:20:00'],
#               dtype='datetime64[us]', freq=None)

print(df2.to_string())
#                        0    1
# 2024-01-01 00:00:00  0.0  NaN
# 2024-01-02 00:00:00  NaN  0.0
# 2024-01-04 11:20:00  NaN  NaN
# 2024-01-05 11:20:00  NaN  NaN

Issue Description

When trying to concatenate by column few non overlapping timeseries dataframes, if the units of the original dataframes are not 'ns' then the resulting dataframe will have missing data (and lose it's frequency value, in case it's relevant).

The example given has 3 dataframes and for some reason the result has missed most of the data from the first and the second dataframe. In case of concatenating 2 dataframes we end up with almost no data at all.

If we set the units to 'ns' everything works as expected, the resulting df has all the data and kept its frequency='5min'. Every other unit I tried failed with similar results than the example.

Expected Behavior

import pandas as pd

idx1 = pd.date_range("2024-01-01", periods=24*12, freq="5min", unit="ns")
idx2 = pd.date_range("2024-01-02", periods=24*12, freq="5min", unit="ns")
idx3 = pd.date_range("2024-01-03", periods=24*12, freq="5min", unit="ns")

ts1 = pd.Series(range(len(idx1)), index=idx1)
ts2 = pd.Series(range(len(idx2)), index=idx2)
ts3 = pd.Series(range(len(idx3)), index=idx3)

df = pd.concat([ts1, ts2, ts3], axis=1)

print(ts1.index)
# DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 00:05:00',
#                '2024-01-01 00:10:00', '2024-01-01 00:15:00',
#                '2024-01-01 00:20:00', '2024-01-01 00:25:00',
#                '2024-01-01 00:30:00', '2024-01-01 00:35:00',
#                '2024-01-01 00:40:00', '2024-01-01 00:45:00',
#                ...
#                '2024-01-01 23:10:00', '2024-01-01 23:15:00',
#                '2024-01-01 23:20:00', '2024-01-01 23:25:00',
#                '2024-01-01 23:30:00', '2024-01-01 23:35:00',
#                '2024-01-01 23:40:00', '2024-01-01 23:45:00',
#                '2024-01-01 23:50:00', '2024-01-01 23:55:00'],
#               dtype='datetime64[ns]', length=288, freq='5min')

print(ts2.index)
# DatetimeIndex(['2024-01-02 00:00:00', '2024-01-02 00:05:00',
#                '2024-01-02 00:10:00', '2024-01-02 00:15:00',
#                '2024-01-02 00:20:00', '2024-01-02 00:25:00',
#                '2024-01-02 00:30:00', '2024-01-02 00:35:00',
#                '2024-01-02 00:40:00', '2024-01-02 00:45:00',
#                ...
#                '2024-01-02 23:10:00', '2024-01-02 23:15:00',
#                '2024-01-02 23:20:00', '2024-01-02 23:25:00',
#                '2024-01-02 23:30:00', '2024-01-02 23:35:00',
#                '2024-01-02 23:40:00', '2024-01-02 23:45:00',
#                '2024-01-02 23:50:00', '2024-01-02 23:55:00'],
#               dtype='datetime64[ns]', length=288, freq='5min')

print(ts3.index)
# DatetimeIndex(['2024-01-03 00:00:00', '2024-01-03 00:05:00',
#                '2024-01-03 00:10:00', '2024-01-03 00:15:00',
#                '2024-01-03 00:20:00', '2024-01-03 00:25:00',
#                '2024-01-03 00:30:00', '2024-01-03 00:35:00',
#                '2024-01-03 00:40:00', '2024-01-03 00:45:00',
#                ...
#                '2024-01-03 23:10:00', '2024-01-03 23:15:00',
#                '2024-01-03 23:20:00', '2024-01-03 23:25:00',
#                '2024-01-03 23:30:00', '2024-01-03 23:35:00',
#                '2024-01-03 23:40:00', '2024-01-03 23:45:00',
#                '2024-01-03 23:50:00', '2024-01-03 23:55:00'],
#               dtype='datetime64[ns]', length=288, freq='5min')

print(df.index)
# DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 00:05:00',
#                '2024-01-01 00:10:00', '2024-01-01 00:15:00',
#                '2024-01-01 00:20:00', '2024-01-01 00:25:00',
#                '2024-01-01 00:30:00', '2024-01-01 00:35:00',
#                '2024-01-01 00:40:00', '2024-01-01 00:45:00',
#                ...
#                '2024-01-03 23:10:00', '2024-01-03 23:15:00',
#                '2024-01-03 23:20:00', '2024-01-03 23:25:00',
#                '2024-01-03 23:30:00', '2024-01-03 23:35:00',
#                '2024-01-03 23:40:00', '2024-01-03 23:45:00',
#                '2024-01-03 23:50:00', '2024-01-03 23:55:00'],
#               dtype='datetime64[ns]', length=864, freq='5min')

print(df.to_string())
#                          0      1      2
# 2024-01-01 00:00:00    0.0    NaN    NaN
# 2024-01-01 00:05:00    1.0    NaN    NaN
# 2024-01-01 00:10:00    2.0    NaN    NaN
# 2024-01-01 00:15:00    3.0    NaN    NaN
# 2024-01-01 00:20:00    4.0    NaN    NaN
# 2024-01-01 00:25:00    5.0    NaN    NaN
# 2024-01-01 00:30:00    6.0    NaN    NaN
# 2024-01-01 00:35:00    7.0    NaN    NaN
# ...
# 2024-01-01 23:40:00  284.0    NaN    NaN
# 2024-01-01 23:45:00  285.0    NaN    NaN
# 2024-01-01 23:50:00  286.0    NaN    NaN
# 2024-01-01 23:55:00  287.0    NaN    NaN
# 2024-01-02 00:00:00    NaN    0.0    NaN
# 2024-01-02 00:05:00    NaN    1.0    NaN
# 2024-01-02 00:10:00    NaN    2.0    NaN
# 2024-01-02 00:15:00    NaN    3.0    NaN
# ...
# 2024-01-02 23:35:00    NaN  283.0    NaN
# 2024-01-02 23:40:00    NaN  284.0    NaN
# 2024-01-02 23:45:00    NaN  285.0    NaN
# 2024-01-02 23:50:00    NaN  286.0    NaN
# 2024-01-02 23:55:00    NaN  287.0    NaN
# 2024-01-03 00:00:00    NaN    NaN    0.0
# 2024-01-03 00:05:00    NaN    NaN    1.0
# 2024-01-03 00:10:00    NaN    NaN    2.0
# 2024-01-03 00:15:00    NaN    NaN    3.0
# 2024-01-03 00:20:00    NaN    NaN    4.0
# ...
# 2024-01-03 23:25:00    NaN    NaN  281.0
# 2024-01-03 23:30:00    NaN    NaN  282.0
# 2024-01-03 23:35:00    NaN    NaN  283.0
# 2024-01-03 23:40:00    NaN    NaN  284.0
# 2024-01-03 23:45:00    NaN    NaN  285.0
# 2024-01-03 23:50:00    NaN    NaN  286.0
# 2024-01-03 23:55:00    NaN    NaN  287.0

df2 = pd.concat([ts1, ts2], axis=1)
print(df2.index)
# DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 00:05:00',
#                '2024-01-01 00:10:00', '2024-01-01 00:15:00',
#                '2024-01-01 00:20:00', '2024-01-01 00:25:00',
#                '2024-01-01 00:30:00', '2024-01-01 00:35:00',
#                '2024-01-01 00:40:00', '2024-01-01 00:45:00',
#                ...
#                '2024-01-02 23:10:00', '2024-01-02 23:15:00',
#                '2024-01-02 23:20:00', '2024-01-02 23:25:00',
#                '2024-01-02 23:30:00', '2024-01-02 23:35:00',
#                '2024-01-02 23:40:00', '2024-01-02 23:45:00',
#                '2024-01-02 23:50:00', '2024-01-02 23:55:00'],
#               dtype='datetime64[ns]', length=576, freq='5min')

print(df2.to_string())
#                          0      1
# 2024-01-01 00:00:00    0.0    NaN
# 2024-01-01 00:05:00    1.0    NaN
# 2024-01-01 00:10:00    2.0    NaN
# 2024-01-01 00:15:00    3.0    NaN
# 2024-01-01 00:20:00    4.0    NaN
# 2024-01-01 00:25:00    5.0    NaN
# 2024-01-01 00:30:00    6.0    NaN
# ...
# 2024-01-01 23:25:00  281.0    NaN
# 2024-01-01 23:30:00  282.0    NaN
# 2024-01-01 23:35:00  283.0    NaN
# 2024-01-01 23:40:00  284.0    NaN
# 2024-01-01 23:45:00  285.0    NaN
# 2024-01-01 23:50:00  286.0    NaN
# 2024-01-01 23:55:00  287.0    NaN
# 2024-01-02 00:00:00    NaN    0.0
# 2024-01-02 00:05:00    NaN    1.0
# 2024-01-02 00:10:00    NaN    2.0
# 2024-01-02 00:15:00    NaN    3.0
# 2024-01-02 00:20:00    NaN    4.0
# 2024-01-02 00:25:00    NaN    5.0
# ...
# 2024-01-02 23:15:00    NaN  279.0
# 2024-01-02 23:20:00    NaN  280.0
# 2024-01-02 23:25:00    NaN  281.0
# 2024-01-02 23:30:00    NaN  282.0
# 2024-01-02 23:35:00    NaN  283.0
# 2024-01-02 23:40:00    NaN  284.0
# 2024-01-02 23:45:00    NaN  285.0
# 2024-01-02 23:50:00    NaN  286.0
# 2024-01-02 23:55:00    NaN  287.0


### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.10.13.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 4.18.0-372.9.1.el8.x86_64
Version               : #1 SMP Fri Apr 15 22:12:19 EDT 2022
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : C
LOCALE                : en_US.UTF-8

pandas                : 2.2.2
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 67.7.2
pip                   : 23.1.2
Cython                : None
pytest                : 8.1.1
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.23.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : 3.8.4
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 16.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.13.0
sqlalchemy            : None
tables                : None
tabulate              : 0.9.0
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None
</details>
@jorgehv jorgehv added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 29, 2024
@nkonts
Copy link

nkonts commented May 2, 2024

I am not sure if it is only related to the unit. Here is an example with ns (and with us as well). Same thing happens with freq="MS". But changing it to freq="D" fixes it.

Also notice that in both outputs the freq attribute of the index changes from ME to None by simply reordering the time series.

Reproduceable Example

import pandas as pd

idx1 = pd.date_range(start="2015-01-31", end="2023-01-31", freq="ME", unit="ns")
idx2 = pd.date_range(start="2014-01-31", end="2024-01-31'", freq="ME", unit="ns")

ts1 = pd.Series(range(len(idx1)), index=idx1)
ts2 = pd.Series(range(len(idx2)), index=idx2)

# Goes from 2014-01-31 (some values of ts1 are missingg) to 2023-01-31 (the values where ts1 is missing are dropped)
df1 = pd.concat([ts1, ts2], axis=1).sort_index()
print(df1.index)

# DatetimeIndex(['2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
#                '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31',
#                '2014-09-30', '2014-10-31',
#                ...
#                '2022-04-30', '2022-05-31', '2022-06-30', '2022-07-31',
#                '2022-08-31', '2022-09-30', '2022-10-31', '2022-11-30',
#                '2022-12-31', '2023-01-31'],
#               dtype='datetime64[ns]', length=109, freq=None)


# As expected output from 2014-01-31 to 2024-01-31
df2 = pd.concat([ts2, ts1], axis=1).sort_index()
print(df2.index)

# DatetimeIndex(['2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
#                '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31',
#                '2014-09-30', '2014-10-31',
#                ...
#                '2023-04-30', '2023-05-31', '2023-06-30', '2023-07-31',
#                '2023-08-31', '2023-09-30', '2023-10-31', '2023-11-30',
#                '2023-12-31', '2024-01-31'],
#               dtype='datetime64[ns]', length=121, freq='ME')

INSTALLED VERSIONS
------------------
commit                : f538741432edf55c6b9fb5d0d496d2dd1d7c2457
python                : 3.11.5.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.22621
machine               : AMD64
processor             : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : de_DE.cp1252

pandas                : 2.2.0
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.8.2
setuptools            : 68.0.0
pip                   : 23.2.1
Cython                : 3.0.3
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : 3.1.8
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.2
IPython               : 8.16.1
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.2
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.3.1
gcsfs                 : None
matplotlib            : 3.8.3
numba                 : 0.59.1
numexpr               : None
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
pyarrow               : 13.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.12.0
sqlalchemy            : 2.0.28
tables                : None
tabulate              : None
xarray                : 2024.2.0
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants